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RESEARCH  OVERVIEW 


Y 

This  report  covers  the  period  from  October  1,  1985  through  March  31,  1986. 

The  research  discussed  here  is  described  in  more  detail  in  several  published 
and  unpublished  reports  cited  below. 

The  CAD  frame  Schema  has  received  limited  but  instructive  applications.  Con¬ 
nections  to  remote  simulation  servers  are  transparent  to  the  user.  More  sub¬ 
stantial  uses  are  anticipated  during  the  next  half  year. 

Waveform  bounding  results  have  been  extended  to  ECL  technology,  and  a  variety 
of  results  are  being  applied  to  an  analog  application  in  early  vision.  Many 
of  the  previous  results  have  been  transferred  to  industry  and  are  being  used 
in  commercial  simulators. 


Fundamental  speed  limits  of  collections  of  active  devices  ^th  parasitics  have 
been  found.  These  are  independent  of  how  the  devices  jjre  interconnected.  A 
reliability  simulator  has  been  written,  and  models  are  being  developed  for  it, 
to  describe  three  important  effects,  namely  metal  migration,  hot-electron 
trapping,  and  time-dependent  dielectric  breakdown. 


Work  continues  on  the  fat-tree  architecture,  and  on  the  efficient  realization 
of  various  classes  of  circuits.  New  results  in  graph  bisection  and  routing 
were  made  during  this  period.  It  is  now  known  how  to  imbed  some  classes  of 
communications  networks  in  others,  for  example  binary  trees  in  hypercubes. 
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THE  DESIGN  OF  SCHEMA 


Ve  have  been  making  steady  progress  on  Schema  over  the  past  few  months.  The 
bulk  of  our  energies  have  been  devoted  to  fixing  bugs  and  providing  missing 
functionality  for  designers.  We  expect  this  effort  to  continue  for  another 
four  or  five  months.  It  is  worthwhile  recalling  what  "basic"  functionality 
Schema  is  providing.  It  contains  a  flexible  schematic-entry  system,  capable 
of  dealing  with  multilevel  hierarchical  schematics.  It  has  been  used  for  MOS, 
TTL  and  analog  designs  and  provides  the  canonical  interface  to  the  topological 
structures  used  by  the  analysis  routines. 

A  simulation  package  allows  the  designer  to  use  a  transient  simulator  like 
Spice  to  simulate  a  circuit.  The  simulation  may  be  run  on  the  local  machine 
or  remotely  on  a  VAX.  In  either  case  the  designer's  interface  is  identical 
and  is  completely  graphics  oriented.  This  system  is  currently  being  extended 
to  include  other  types  of  simulators. 

A  simple  layout  system,  along  the  lines  of  DPL/Daedalus,  can  be  used  to  create 
and  edit  masks.  As  in  the  schematic  system,  the  ultimate  description  gener¬ 
ated  by  Schema  is  a  procedure  that  when  run  will  create  the  mask.  This  pro¬ 
vides  the  ultimate  in  parameterizability.  Structures  have  been  set  up  to 
allow  the  system  to  be  used  for  NMOS,  CMOS  and  GaAs  designs. 

All  of  these  tools  and  their  data  structures  reside  in  a  common  address  space, 
so  it  is  easy  for  a  CAD  designer  to  build  programs  that  straddle  different 
design  domains.  Furthermore,  the  data  structures  used  by  Schema  have  common 
interface  specifications,  so  they  can  be  mixed  quite  flexibly. 


The  two  designs  we  have  run  through  Schema  were  a  CMOS  sorter  and  a  TTL  PC 
board  for  gathering  statistics  from  a  Multibus.  Neither  design  was  completed 
using  Schema,  but  at  least  in  the  CMOS  case,  the  designer  was  able  to  complete 
about  95Z  of  the  design  using  Schema.  Both  efforts  gave  us  a  great  deal  of 
information  about  what  is  missing  in  the  system  and  what  performance  improve¬ 
ments  are  needed.  These  efforts  are  under  way  now  and  we  expect  to  try  a  more 
substantial  project  towards  the  end  of  the  spring. 


THE  WAVEFORM  BOUNDING  APPROACH  TO  DELAY  ESTIMATION 


In  the  past  six  months  ve  have  tightened  the  bounds  for  interconnect  delay,  ve 
have  begun  to  resolve  the  modeling  issues  that  arise  when  the  delay  bounds  are 
used  for  timing  analysis  of  ECL  chips,  and  ve  have  initiated  an  analytical 
study  and  preliminary  design  effort  to  determine  the  feasibility  of  doing  ana¬ 
log  or  hybrid  analog/digital  massively  parallel  signal  processing  in  MOS  VLSI 
for  early  vision  applications. 

Dave  Standley  is  doing  the  work  on  bound-tightening  for  his  S.M.  thesis.  He 
has  written  a  very  elegant  derivation  of  the  new  time  constant  Tpj  for  arbi¬ 
trary  trees,  and  his  result  is  now  in  use  in  the  new  line  of  CAD  products  just 
released  by  Tangent  Systems  (2840  San  Tomas  Expressway,  Suite  200,  Santa 
Clara,  CA  95051).  He  has  also  found  a  new  lover  bound  for  the  rising  step 
response  of  RC  trees.  This  research  area  is  becoming  mature,  and  all  the  easy 
results  are  probably  known  now. 

Peter  O'Brien  is  doing  the  work  on  interconnect  delay  in  ECL  for  his  S.M. 
thesis.  The  first  part  consisted  of  extending  the  Penfield  results  to  include 
the  case  of  "leaky"  RC  trees,  i.e.,  trees  with  resistive  paths  to  ground, 
required  for  modeling  bipolar  transistors  as  loads.  This  work  is  complete. 

He  is  now  working  on  the  second  part  of  the  problem,  developing  the  required 
macromodels  for  ECL  logic  gates.  Both  parts  of  this  project  are  being  done  in 
close  collaboration  with  the  Digital  Equipment  Corporation  for  use  in  an  in- 
house  ECL  timing  analyzer  under  development  there. 

The  early  vision  work  began  with  an  attempt  to  apply  our  interconnect  delay 
results  to  a  proposed  analog  early  vision  chip  to  estimate  how  fast  it  might 
run.  Discussions  with  Prof.  Thomasio  Poggio  and  Dr.  Christof  Koch  in  the  MIT 
AI  Lab  have  led  to  a  joint  interest  in  using  the  properties  of  RC  networks, 
realized  by  a  rectangular  grid  of  polysilicon  on  an  MOS  chip,  to  perform  a 
variety  of  early  vision  tasks  quite  rapidly  and  cheaply  by  analog  methods. 

The  simplest  and  most  promising  project  seems  to  be  analog  depth  interpolation 
for  stereo  vision.  The  idea  is  quite  speculative  but  the  potential  improve¬ 
ment  over  current  implementations  such  as  the  connection  machine  is  many 
orders  of  magnitude  in  size  and  cost,  and  one  or  two  orders  of  magnitude  in 
speed. 
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HIGH  PERFORMANCE  CIRCUIT  DESIGN 


RELIC,  our  reliability  simulator,  is  beginning  to  show  signs  o£  life.  The 
objective  of  this  research  effort  is  to  demonstrate  a  simulator  which  predicts 
the  reliability  of  MOS  VLSI  circuits.  More  precisely,  we  would  like  to  pre¬ 
dict  and  compare  the  reliability  of  different  circuits  with  respect  to  those 
failure  mechanisms  under  the  control  of  the  circuit  designer.  In  this  project 
we  are  examining  three  failure  mechanisms:  metal  migration,  hot-  electron 
trapping,  and  time-dependent  dielectric  breakdown  (TDDB).  The  long  term  fail¬ 
ure  statistics  of  the  circuit  are  predicted  based  on  the  "stress"  accumulated 
during  one  cycle,  assuming  that  this  cycle  repeats  forever. 

There  are  three  parts  to  the  simulator:  the  preprocessor,  the  models,  and  the 
post  processor.  Work  has  begun  on  the  first  two  and  we  hope  to  have  all  parts 
running  by  June.  The  models  are  the  central  part  of  the  program.  Ve  have 
added  1986  TDDB  and  metal  migration  models  to  the  underlying  simulator,  RELAX, 
from  Berkeley.  (Our  thanks  to  Berkeley  for  letting  us  have  early  access  to 
their  software.)  The  hot-electron  model  is  still  fighting  us. 

We  are  starting  a  new  research  project  into  the  high  frequency  behavior  of 
VLSI  circuits.  About  20  years  ago  it  was  shown  that  fundamental  frequency 
domain  quantities  such  as  could  be  used  to  predict,  for  instance,  the 

properties  of  ring  oscillators.  We  are  extending  this  classical  theory  in 
three  directions.  First,  we  have  reformulated  it  in  terms  of  Tellegen's  theo¬ 
rem.  Second,  we  are  extending  it  to  much  more  complicated  —  and  realistic  — 
models.  And  third,  we  are  making  some  headway  on  a  nonlinear  formulation  of 
the  problem.  It  is  too  early  to  predict  where  all  of  this  will  go,  but  we 
hope  to  be  able  to  say  things  about  what  sort  of  performance  one  could  possi¬ 
bly  get  out  of  (digital?)  networks  made  up  of  certain  types  of  components.  A 
possible  application  for  this  work  is  cross  comparisons  of  technologies,  where 
one  would  like  to  compare  the  cleverness  of  the  device  engineers  rather  than 
that  of  the  circuit  designers.  Another  possibility  would  be  in  the  area  of 
making  technology  trade-offs  in  the  design  of  CMOS  devices.  For  instance,  is 
it  "better"  to  have  a  larger  lateral  diffusion  with  more  overlap  capacitance 
but  shorter  transistors  or  is  it  better  to  minimize  the  parasitic  capacitance? 
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Professor  Charles  E.  Leiserson  has  extended  his  work  on  fat-trees  and  now  has 
improved  designs  for  area-  and  volume-universal  networks.  One  simple  design 
contains  only  n  small  switches  for  a  network  with  n  processors.  He  and 
Ron  Greenberg  have  been  able  to  provide  good  message-routing  algorithms  for 
this  and  other  universal  networks. 

Bruce  M.  Maggs  and  Professor  Leiserson  have  been  exploring  parallel  algorithms 
for  fat-trees.  The  discovery  of  volume-universal  networks  with  different 
topologies  from  fat-trees  motivated  them  to  abstract  a  model  for  such  networks 
that  would  allow  them  to  design  parallel  algorithms,  but  which  would  not  take 
advantage  of  the  specific  topology  of  the  network.  Their  new  model,  called  a 
distributed  random-access  machine  (DRAM),  allows  the  communication  require¬ 
ments  of  an  algorithm  to  be  measured. 

Joe  Kilian,  Shlomo  Kipnis,  and  Professor  Leiserson  have  been  investigating  the 
power  of  multipin  nets.  Surprisingly,  it  is  possible  to  make  a  barrel  shifter 
of  n  chips,  each  with  0(/n)  pins,  which  can  perform  any  shift  in  1  clock 
tick.  With  2-pin  nets,  n  -  1  pins  per  chip  are  required.  They  can  also 
show  that  0(/n)  pins  suffice  to  realize  any  abelian  group  of  permutations  of 
n  elements,  and  that  0(/(n  log  n))  pins  suffice  to  realize  almost  all  permu¬ 
tations. 

Andrew  Goldberg  has  been  working  on  parallel  algorithms  for  network  flow  prob¬ 
lems.  He  has  discovered  a  new  algorithm  for  maximum  flow  that  should  run  well 
on  a  parallel  machine,  and  in  particular,  on  a  DRAM.  He  and  Bob  Tarjan  of 
Princeton  have  been  able  to  convert  this  parallel  algorithm  into  a  normal 
sequential  algorithm  that  is  faster  than  previously  known  algorithms  for  this 
well-studied  problem. 

Thang  Bui  completed  his  Ph.D.  thesis  on  graph-bisection  algorithms.  Included 
in  the  thesis  is  joint  work  with  Profs.  Leighton  and  Sipser  on  an  algorithm 
which  is  guaranteed  to  find  the  optimal  bisection  for  almost  all  graphs  with 
small  bisection  width.  The  algorithm  is  particularly  well  suited  to  low 
degree  graphs  such  as  those  that  arise  in  VLSI  placement  and  routing  problems. 
Prof.  Leighton  and  Bui  also  discovered  a  method  of  speeding  up  and  improving 
the  performance  of  other  graph  bisection  heuristics  such  as  Kernighan-Lin  and 
simulated  annealing.  Tests  and  analysis  of  the  new  method  are  still  under 
way,  but  it  now  appears  that  the  speed  of  the  heuristics  can  be  cut  in  half 
and  that  the  size  of  the  bisections  produced  can  be  decreased  by  10-25)S. 

These  figures  represent  substantial  savings  for  large  graphs. 

Bonnie  Berger  completed  her  Master's  thesis  on  channel  routing.  Included  in 
the  thesis  is  joint  work  with  Prof.  Leighton  on  the  development  of  an  algo¬ 
rithm  which  can  route  any  2-point  net  channel  routing  problem  using  d  +  0(/d) 
tracks,  where  d  is  the  density  of  the  channel.  The  algorithm  uses  unit  ver¬ 
tical  overlaps  of  wires  and  knock-knees  in  two  layers,  but  does  not  otherwise 
allow  wires  to  overlap.  The  algorithm  is  very  close  to  optimal  since  d 
tracks  are  always  required  for  a  problem  with  density  d.  (Even  when  arbi- 


trary  overlap  is  allowed,  Leighton  has  shown  that  most  interesting  problems 
still  require  d  layers.)  Moreover,  the  algorithm  uses  only  half  as  many 
tracks  as  the  best  previously  known  algorithm,  which  was  discovered  by  Rivest, 
Baratz  and  Miller.  For  multipoint  net  problems,  the  algorithm  uses  2d  +  0(/d) 
tracks,  which  is  also  half  as  many  as  the  best  previously  known  algorithms. 
Professor  Leighton  is  currently  working  with  Berger  and  others  to  extend  the 
algorithm  to  handle  multilayer  channel-routing  problems.  The  initial  results 
appear  quite  favorable.  They  have  shown  that  any  L-layer  problem  can  be 
routed  in  d/L  +  0(/d)  tracks  which  is  very  close  to  the  lower  bound  of  d/L 
tracks.  For  multipoint-net  problems,  the  bound  is  d/(L-l)  +  0(/d). 

In  the  area  of  wafer-scale  integration,  Prof.  Leighton  and  Peter  Shor  have 
shown  how  to  use  matching  algorithms  to  construct  two-dimensional  systolic 
arrays  from  the  working  cells  on  an  N-cell  wafer  using  wires  of  length 
O(log^'^N).  The  result  assumes  that  the  faulty  cells  are  located  at  random  on 
the  wafer,  and  solves  a  long-open  question  in  mathematical  statistics.  The 
work  will  be  reported  in  the  upcoming  ACM  Symposium  on  Theory  of  Computing  in 
San  Francisco. 

Prof.  Leighton  has  also  made  advances  on  several  network-embedding  problems. 
Working  with  others,  he  has  shown  how  to  efficiently  embed  arbitrary  binary 
trees  in  the  hypercube  and  other  networks.  Although  the  results  are  still 
preliminary,  this  work  will  probably  include  the  discovery  of  the  first 
bounded-degree  network  known  to  contain  all  binary  spanning  trees  as 
subgraphs.  Professor  Leighton  is  also  studying  related  graph-embedding 
problems  with  the  aim  of  using  one  network  (e.g.,  a  hypercube)  to  efficiently 
simulate  another  network  (e.g.,  a  mesh). 
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ABSTRACT 


As  MOS  technologies  advance,  the  relative  differences  between  on-chip  and 
off-chip  delays  increase.  Drivers  and  receivers  can  be  designed  which  allow 
high  bit  rate  communications  (>100  Mbits/sec)  between  MOS  chips  at  the  expense 
of  increased  latency.  Designing  synchronous  systems  which  couple  a  high  clock 
frequency  with  large  and  variable  delays  is  difficult  and  expensive  due  to  the 
complexity  of  insuring  that  no  delays  violate  the  constraints  imposed  by 
synchronous  operation. 

A  circuit-based  technique  for  automatically  adjusting  signal  delays  in  an 
MOS  system  has  been  developed.  The  Dynamic  Delay  Adjustment  (DDA)  technique 
provides  reliable  high  speed  communications  directly  between  MOS  chips 
independent  of  the  delay  between  the  chips.  The  amount  of  phase  jitter  immunity 
provided  by  the  synchronizer  can  be  traded  off  against  circuit  complexity;  the 
signal  delays  are  adjusted  continuously  to  track  temperature  induced  delay 
variations.  A  3  micron  CMOS  DDA  synchronizer  has  been  fabricated  to  confirm  the 
validity  of  the  DDA  approach;  test  results  will  be  presented. 
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part  by  the  Defense  Advanced  Research  Projects  Agency  under  Contract  number 
N00014-80-C-0622. 

••Current  address;  Bolt,  Beranek  and  Newman,  Room  6/501,  10  Molton  St., 
Cambridge,  MA  02238;  (617)  497-2471. 

Copyright  (c)  1985,  MIT.  Memos  in  this  series  are  for  use  inside  MIT  and 
are  not  considered  to  be  published  merely  by  virtue  of  appearing  in  this  series. 
This  copy  is  for  private  circulation  only  and  may  not  be  further  copied  or 
distributed.  References  to  this  work  should  be  either  to  the  published  version, 
if  any,  or  in  the  form  "private  communication."  For  information  about  the  ideas 
expressed  herein,  contact  the  author  directly.  For  information  about  this 
series,  contact  Microsystems  Research  Center,  Room  39-321,  MIT,  'Cambridge,  MA 
02139;  (617)  253-8138. 


DEPARTMENT  OF  ELECTRICAL  ESGINEERISG  ASD  COMPUTER  SCIENCE 


A'' - 

!  t  !;  MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 


CAMBRIDGE 


MASSACHUSETTS  0  2  13  9 


VLSI  Memo  No.  85-268 


October  1985 


Interproeesaor  Communication  Issues  in  Fat-Tree  Architectures* 


Alexander  Toichi  Ishii** 


ABSTRACT 


In  recent  years,  it  has  become  increasingly  evident  that  conventional 
computer  architectures  will  be  unable  to  perform,  in  an  acceptable  time  frame, 
many  of  the  computational  functions  that  we  would  desire  of  them.  Consequently, 
much  research  has  been  devoted  to  the  concept  of  constructing  super-computers, 
which  will  be  able  to  exploit  the  potential  for  parallel  computation  intrinsic 
to  many  large  computational  problems. 

Recently,  Leiserson  has  proposed  a  multiprocessor  scheme  based  on 
Leighton's  "tree  of  meshes,"  called  a  "fat-tree."  Conceptually,  such  a 
multiprocessor  would  be  comprised  of  a  set  of  n  processing  elements  each 
situated  as  a  leaf  in  a  complete  binary  tree.  Internal  nodes  would  be  high 
speed  switches  which  route  messages  being  passed  between  processing  elements, 
while  edges  between  nodes  would  be  bundles  of  constant  bandwidth  communication 
paths. 


The  purpose  of  this  document  will  be  to  address  and  define  some  of  the 
Issues  which  effect  Interprocessor  communication  within  a  fat- tree 
multiprocessor.  Specifically,  we  will  cover  the  following  topics: 

1 .  Addressing  in  a  fat-tree 

2.  Generation  of  addresses  in  a  fat-tree 

3 .  Allocation  of  communication  resources  in  a  fat-tree 
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Comonting  Minimtan  Spanning  Trees 
On  a  Fat-Tree  Architecture* 


Bruce  M.  Maggs** 


AB  STB  ACT 


This  paper  presents  two  algorithms  for  computing  the  minimum  spanning 
forest  of  an  input  graph  on  a  fat-tree  architecture.  One  algorithm  is 
deterministic,  and  the  other  probabilistic.  The  deterministic  algorithm 
generates  O(log^lvl)  message  sets,  each  of  which  can  be  delivered  in  0(p(G)) 
delivery  cycles.  The  probabilistic  algorithm  generates  O(log^|v|)  message  sets, 
each  of  which  can  be  delivered  in  0(p(G))  delivery  cycles. 
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Space-Efficient  Algorithms  for  Computational  Geometry* 
Cynthia  A.  Phillips** 


ABSTRACT 

This  thesis  presents  an  algorithm  for  determining  the  connectivity  of  a  set 
of  N  rectangles  In  the  plemei  a  problem  central  to  avoiding  aliasing  In  VLSI 
design  rule  checkers.  Previous  algorithms  for  this  problem  either  worked  slowly 
with  a  small  amount  of  primary  memory  space,  or  worked  quickly  but  used  more 
space.  The  eilgorithm  presented  here,  based  upon  a  technique  called  scanning, 
operates  in  0(N  Ig  N)  time  in  the  worst  case.  This  matches  the  running  time  of 
the  best  known  sequential  algorithm  for  this  problem.  Because  we  use  a  machine 
model  that  explicitly  incorporates  secondary  memory,  the  new  connected 
components  algorithm  avoids  unexpected  disk  thrashing  which  leads  to  lower 
performance.  The  algorithm  uses  0(W)  primary  memory  space,  where  W,  the  scan 
width,  is  the  maximum  number  of  rectangles  to  cross  any  vertical  cut.  It 
requires  no  more  than  0(N)  transfers  between  primary  and  secondaiTy  memory. 

When  a  vertical  line  passes  through  a  set  of  rectangles,  those  rectangles 
cut  by  the  line  form  a  set  of  line  segments.  The  key  to  development  of  space- 
efficient  algorithms  using  a  two  layer  memory  model  is  that  appropriate 
manipulations  of  these  segments  alone  can  solve  more  complicated  problems  such 
as  the  connected  components  problem.  This  thesis  introduces  interval  trees,  a 
simple,  sparse,  data  structure  for  storing  a  set  of  k  line  segments.  With  this 
data  structure,  a  variation  on  a  balanced  search  tree,  one  can  perform  each  of 
the  following  operations  in  0(lg  k)  time  in  the  worst  case:  1)  insert  a  new 
segment,  2)  delete  a  segment,  and  3)  given  a  test  interval,  return  a  segment 
which  intersects  that  test  intervaiL  or  return  nil  if  there  is  no  such  segment. 
This  data  structure  is  used  in  the  new  connected  components  adgorithm.  It  can 
also  be  used  to  improve  other  existing  algorithms  for  computational  geometry 
problems. 
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ABSTRACT 


Fat-trees  are  a  class  of  routing  networks  for  hardware-efficient  parallel 
computation.  This  paper  presents  a  randomized  algorithm  for  routing  messages  on 
a  fat-tree.  The  quality  of  the  algorithm  is  measured  in  terms  of  the  load 
factor  of  a  set  of  messages  to  be  routed,  which  is  a  lower  bound  on  the  time 
required  to  deliver  the  messages.  We  show  that  if  a  set  of  messages  has  load 
factor  X  =  Q(lg  n  Ig  Ig  n)  on  a  fat-tree  with  n  processors,  the  number  of 
delivery  cycles  (routing  attempts)  that  the  algorithm  requires  is  0(X)  with 
probability  l-0(l/n).  The  best  previous  bound  was  0(Xlg  n)  for  the  off-line 
problem  where  switch  settings  can  be  determined  in  advance.  In  a  VLSI-like 
model  where  hardware  cost  is  equated  with  physical  volume,  we  use  the  routing 
algorithm  to  demonstrate  that  fat-trees  are  universal  routing  networks  in  the 
sense  that  ivy  routing  network  can  be  efficiently  simulated  by  a  fat-tree  of 
comparable  hardware  cost. 
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ABSTRACT 


This  paper  documents  the  procedure  for  designing  a  self-timed  controller 
for  a  successive-approximation  A/D  converter.  From  the  functional 
specification,  a  Signal  Transition  Graph  Is  constructed  to  describe  the 
operation  of  the  control  circuit.  This  graph  Is  then  modified  into  a  well- 
formed  graph.  Such  a  graph  can  be  transformed  into  a  dead lock- free  and  hazard- 
free  implementation  directly.  The  structure  of  the  control  circuit  and  the 
logic  equations  are  then  derived  directly  from  the  graph. 
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Abstract 

Tliis  paper  <locituu*uts  the  prorc<lurc  for  (lc*signing  a  soif'timcd 
controller  for  a  succesaive-appnwiinatiou  A/D  converter.  From  the 
fuuctioimi  specification,  a  Sigiuol  Transition  Graph  is  con.stnicted  to 
describe  the  operation  of  the  control  cirniit.  This  graph  is  then  modi¬ 
fied  uito  a  well-formed  graph.  Such  a  graph  can  bo  transformtHl  into  a 
deadlock- free  and  hazard-free  buplcnientation  directly.  The  structure 
of  the  control  circuit  and  the  logic  equations  arc  then  derived  directly 
from  the  graph. 

1.  Introduction 

This  paper  describes  the  design  of  a  control  circuit  for  a  successive* 
approximation  A/D  converter.  This  controller  is  an  asynchronous  self- 
timed  circuit  in  which  all  control  actions  are  carried  out  through  the  use  of 
the  Rcquest/Acknowlcdge  signaling  protocol.  From  a  functional  specifica¬ 
tion  of  the  control  circuit,  a  Signal  Transition  Graph.  (STG)  is  constructed 
to  describe  the  behavior  of  the  circuit.  It  is  then  modified  into  a  well-formed 
graph,  which  is  one  satisfying  the  liveness  and  persistency  properties.  This 

^Thia  ri'avarrh  waa  aiipportcd  in  part  by  a  Hnghca  FcUowahip  and  the  NSF  under  grant 
niuubcr  MCS -7915255. 

roanirrh  waa  aiipportcd  by  the  Dcfciiae  Advanced  Rcacarch  Project  Agency  under 
contract  niuubcr  NOOOI4-80-C-OC22. 


1 


r 


STC  can  be  converlcd  directly  into  a  logic  circuit  through  a  nuuibcr  of  syn¬ 
thesis  steps.  A  theory  of  the  Signal  Transition  Graph  model  is  discussed 
in  [l].  SulHcient  details  about  the  model  will  be  given  here  to  explain  the 
synthesis  process  from  STG. 

The  main  tvlvantagc  of  this  realization  method  is  that  it  can  produce 
a  circuit  directly  from  a  well-formed  speciGcation,  and  the  circuit  is  speed- 
independent,  i.e.,  it  operates  correctly  with  any  combination  of  delays  of 
logic  gates.  This  feature  is  iniportaul  for  VLSI  applications,  as  it  is  in- 
efficient  and  not  always  possible  to  fine-tune  the  delays  of  logic  gates  on 
chip  to  make  an  asynchronous  system  work.  The  STG  model  allows  the 
specification  of  concurrency,  <ind  hence  the  control  logic  synthesized  from 
this  model  supports  concurrent  operations.  The  traditional  approach  for 
designing  asynchronous  stcate  machines  can  only  model  sequential  control 
actions,  and  furthermore  they  are  diflicult  to  realize  because  of  problems 
due  to  races  and  hazards.  In  contrast,  the  approach  presented  here  pro¬ 
duces  hazard-free  control  circuits  capable  of  handling  parallel  operations. 

2.  Behavior  Specification  of  the  Controller 

The  block  diagram  of  the  successive  approximation  A/D  converter  is  shown 
in  Figure  1.  The  input  comparator  compares  the  input  voltage  Vi„  and  the 
reference  voltage  Vr^/  and  produces  a  digital  1-bit  result.  The  comparator 
has  a  control  input  Zr  which  zeroes  it  when  Zr  makes  a  low- to- high  transi¬ 
tion,  and  starts  the  comparison  when  Zr  makes  a  high-to-low  transition.  It 
also  has  a  mutual-exclusion  circuit  whose  output  is  active  (=1)  only  when 
the  comparator  output  is  valid.  This  is  required  because  the  comparison 
time  is  a  function  of  the  difference  between  the  input  voltage  and  the  ref¬ 
erence  voltage;  the  smaller  the  difference,  the  longer  the  time  it  takes  for 
the  comparator  to  decide.  This  is  the  familiar  problem  of  metastability [2]. 

The  latch  and  the  combinational  logic  form  a  finite  state  machine  per¬ 
forming  the  successive  approximation  algorithm.  Note  that  this  machine 
operates  in  pulse  mode  and  is  not  the  same  as  the  self-tuned  controller 
we  are  synthesizing.  Due  to  the  fact  that  this  machine  performs  many 
data-dependent  operations,  it  is  more  economical  and  straight-forward  to 
implement  it  in  this  form  instead  of  a  Huffman  asynchronous  state  machine. 
Data  are  latched  on  the  rising  transition  of  signal  Lr  and  held  in  registers 
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Figure  1:  Block  diagram  of  the  successive-approximation  A/D  converter 

after  L,  goes  low.  Signal  L„  goes  high  as  soon  as  data  are  latched,  and  goes 
low  shortly  after  i,  goes  low.  The  reset  input  of  the  latch  is  controlled  by 
signal  Req,  so  that  when  Req  is  low,  its  outputs  are  reset  to  the  appro¬ 
priate  initial  state.  Signal  LB  is  the  Last'Bii  signal  which  goes  high  when 
the  converter  has  determined  the  last  bit  of  the  digital  word.  The  D/A 
converter  at  the  right  of  the  diagram  accepts  the  digital  word  produced 
by  the  state  machine  and  generates  the  analog  voltage  Vre/.  The  combined 
delay  of  the  combinational  logic  and  the  D/A  converter  is  matched  by  some 
delay  circuit  as  indicated  by  a  drished  line  from  to  Da-  While  it  is  pos¬ 
sible  to  accomplish  this  timing  constraint  in  a  speed-independent  manner 
using  dual  rail  coding,  a  simple  delay  circuit  is  more  justifiable  from  an 
engineering  standpoint. 

Initially,  the  state  of  the  system  is  Req  —  Ack  =  2,  =  Zo  =  £,  =  £*  =  0 
and  Dr  =  Da  —  1.  Since  Req  =  0,  the  latch  is  initialized  with  LB  =  0. 
Thus,  the  and-gate  whose  input  is  Req  is  enabled  and  the  and-gate  whose 
output  is  Ack  is  disabled.  When  Req  is  raised,  2,  will  go  high  and  initiate  a 
cycle  of  the  successive-approximation  algorithm.  After  each  cycle.  Da  will 
restart  another  cycle  until  LB  becomes  high  during  the  last  cycle  of  the 


nlgorilhin.  In  the  hist  cycle,  Ar.k  is  raised  when  l),^  goes  high.  After  that, 
Ileq  drops,  resetting  Lli  and  in  turns  Ack  to  low.  At  this  point  the  circuit 
returns  to  its  initial  state,  ready  for  the  next  conversion. 

3.  Constructing  a  STG  from  the  Specification 

A  STG  describing  the  operation  of  the  sclf-tinicd  controller  is  shown  in 
Figure  2.  In  this  STG,  vertices  represent  control  events  corresponding 
directly  to  rising  and  falling  transitions  of  signals  in  the  controller;  directed 
arcs  between  transitions  are  timing  precedence  constraints.  Transitions  of 
input  signals  to  the  circuit  arc  underlined  to  distinguish  them  from  those 
of  non-input  signals.  Preccndencc  constraints  of  the  former  are  given  by 
the  specification  and  are  assumed  to  be  satisfied  by  the  outside  world, 
whereas  precedence  constraints  of  the  latter  are  generated  and  satisfied 
by  the  circuit  obtained  from  the  STG.  The  dashed  arcs  in  this  figure  are 
not  derived  from  the  behavior  specification  of  the  circuit,  but  are  extra 
constraints  to  mjike  the  STG  persistent^  as  will  be  explained  later.  The 
meaning  of  the  transitions  are  also  described  in  this  figure.  The  notation 

aH - ►  6—  indicates  that  the  rising  of  signal  a  has  to  precede  the  falling 

of  b,  and  that  the  transition  b—  is  directly  caused  by  a-F.  The  notation 
a+  —*p  b—  indicates  that  the  occurrence  of  o+  precedes  that  of  b—  but 
it  may  or  may  not  directly  precede  6—.  The  And-fork  relation  aRji{b,c) 
means  that  a  —*  b  and  a  — »  c,  and  that  the  occurrence  of  a  will  cause  the 
concurrent  occurrences  of  b  and  c.  The  And-join  relation  (a,  means 

that  a  — »  c  and  6  — »  c,  and  that  c  occurs  only  after  both  a  and  6  have 
occurred. 

There  are  two  fundamental  properties  of  STG  concerning  the  synthesis 
of  hazard-free  and  deadlock-free  control  circuits,  those  of  liveness  jmd  per¬ 
sistency.  A  STG  satisfying  these  properties  is  well-fonned  and  possesses  a 
hazard-free  and  deadlock-free  realization.  If  a  STG  is  strongly  connected 
and  for  every  transition  of  signal  t,  there  exist  at  le:ist  a  directed  path  from 

to  f  —  and  at  least  one  from  t—  to  t-F,  then  the  corresponding  circuit  real¬ 
ization  is  live,  i.e.,  free  from  deadlocks.  The  second  property  is  persistency, 
which  states  that  if  a  signal  is  enabled  in  some  state  of  the  system,  only 
a  transition  of  that  signal  can  bring  the  system  to  another  state  in  which 
it  is  no  longer  enabled.  In  terms  of  STG  notations.  Fig.  3a  illustrates  an 


Figure  2:  The  STG  description  of  the  self-timed  controller 

example  of  a  violation  of  persistency:  transition  a-f  causes  transition  6-+-, 
and  also,  transition  o—  is  concurrent  with  b+.  Thus,  while  i  is  going  high, 
transition  a—  may  occur  and  remove  the  enabling  condition  of  transition 
6-1-,  resulting  in  a  hazard  of  signal  6.  Fig.  3b  shows  that  an  additional 
constraint  6-i — ►p  a—  is  required  to  satisfy  the  persistency  property. 

The  STG  in  Fig.  2  is  a  strongly  connected  digraph  which  satisfies  the 
liveness  property,  therefore  it  corresponding  circuit  is  live.  The  transitions 
Dr—,Da—  and  La—  are  merely  reset  transitions  of  the  reset  signaling  proto¬ 
col.  Two  constraints  in  this  graph  deserve  careful  attention:  the  constraint 
D„-^  —*  Zr—  is  required  because  a  comparison  is  not  allowed  to  begin  until 

after  a  new  value  of  Vre/  is  available;  the  other  constraint  L, - ►  Zr+  makes 

sure  that  the  gating  signal  of  the  latch  is  turned  off  before  the  comparator 
changes  its  output  value. 


4.  Making  the  STG  persistent 

In  general,  a  STG  constructed  from  the  behavior  specification  of  a  control 
circuit  alone  may  not  be  realizable.  In  order  to  have  a  deadlock-free  and 


Figure  3:  (a)  A  violation  of  persistency,  (b)  elimin.ation  of  the  violation 


hazard-free  realization,  it  must  satisfy  two  fundamental  properties  stated 
above.  In  the  original  specification  in  Figure  2  (without  the  dashed  ar.es), 
one  can  immediately  detect  two  cases  of  violation  of  persistency.  The  fork 
La  +  ZA{Dr-h,Lr-)  indicates  that  La+  causes  Dr-f-,  but  transition  - 
which  directly  follows  L,—  is  concurrent  with  Dr+.  Thus,  while  Dr+  is 
occurring.  La—  may  occur  and  remove  the  enabling  condition  of  transition 

The  dashed  arc  D,-| - ►  Lr—  eliminates  this  problem  by  adding  the 

constraint  Dr+  —*p  La—  to  the  graph.  In  this  case  a  direct  constraint 

Dr-\ - *  La—  is  not  allowed  because  La—  is  a  transition  of  an  input  node  of 

the  corresponding  circuit;  constraints  to  input  transitions  are  given  by  the 
specification  and  there  can  be  exactly  one  arc  incident  to  an.  input  transi¬ 
tion.  Similarly,  the  violation  at  the  fork  Da  +  ZAiDr—,Zr—)  is  eliminated 

by  adding  the  dashed  arc  Zr - ►  D,.—  to  the  graph.  After  adding  these  two 

constraints  to  eliminate  violations  of  persistency,  one  can  further  eliminate 

constraints  La-\ - *  Lf-  and  Da-\ - -  Dr—  Jis  they  become  redundant.  At 

this  stage,  an  STG  as  shown  in  Figure  4a  is  obtained. 

It  turns  out  that  this  modified  graph  is  still  not  implementable  due  to 
the  lack  of  internal  state  information.  This  is  a  subtle  case  of  violation  of 
persistency  which  will  be  discussed  later.  An  internal  state  signal  called  x  is 
introduced  and  the  transitions  x-\-,x—  are  added  as  shown  in  Figure  4b.  In 
order  to  understand  how  this  node  is  introduced  into  the  STG  specification, 
the  synthesis  process  from  STG  needs  be  discussed  before  this  issue  can  be 
explored  (Section  6.) 


5.  Synthesis  of  Circuit  from  STG 


In  the  circuit  realization  of  this  graph,  nodes  La,Da,Za  are  input  nodes  to 
the  circuit,  other  nodes  are  non-input  ones  whose  logic  equations  have  to  be 


Figure  4:  (a)  The  STG  with  two  violations  of  persistency  eliminated,  and 
(b)  the  STG  with  the  addition  of  the  internal  state  i. 

determined.  The  set  of  non-input  nodes  is  {Dr-,Lr,Zr,x).  First,  the  STG 
in  Fig.  4b  is  decomposed  into  number  of  reduced  graphs,  as  described  below. 
For  each  non-input  node  '  t,  in  set  {Dr,Lr,Zr,x),  we  find  its  input  set  I{ti) 
defined  as  the  set  of  nodes  whose  transitions  directly  precede  transition  (<, 
i.e., 

I[U)  =  {tj  I  tj  ti}. 

Thus,  in  the  STG  m  Fig.  4b,  I{Lr)  =  {Dr,Za,x},  I{Zr)  =  {Lr,Da,x}, 
/(i)  =  {Za,Da}  and  I{Dr)  =  The  last  insput  set  I{Dr)  is  a 

special  case  because  even  though  node  i  does  not  directly  precede  £?,,  it  is 
included  in  I  {Dr)  to  avoid  a  violation  of  persistency,  as  will  be  discussed  in 
Section  6.  A  reduced  graph  Gii{ti)  of  node  U  is  then  obtained  by  removing 
all  transitions  in  the  STG  that  do  not  belong  to  the  set  I{t,)  U  {t,},  keeping 
all  precedence  constraints  intact.  The  reduced  gr.aphs  G/i(Dr),  Gn(Zr), 
G/i(Lr)  and  G/j(i)  are  shown  in  Fig.  5.  A  reduced  graph  for  a  non-Lnpiit 
node,  e.g.  Zr,  contains  transitions  of  nodes  in  the  set  I{Zr)U  (Zr),  and  the 
logic  element  that  realues  node  Zr  has  one  output  terminal  being  Zr  and 
input  terminals  being  those  in  the  set  I{Zr).  This  graph  contains  all  the 


(u)  HcJucrU  anipli  for  /.r 


(c)  Ri  ducvd  srtipit  for  Dr 


(J)  Reduced  fintph  for  x 


Figxire  5:  Reduced  graphs  for  non-input  nodes  in  the  set  {Zr,  Lr,  Dr,x} 

infonnation  about  the  timing  behavior  of  logic  element  Zr  as  speciSed  by 
the  precedence  relations  between  its  input  and  output  terminals.  A  logic 
equation  can  be  determined  from  this  specification  as  shown  next. 

The  final  step  in  the  synthesis  process  is  to  derive  logic  equations  from 
reduced  graphs.  This  step  is  illustrated  for  nodes  Zr  and  z.  The  equations 
for  Dr  and  £,  will  be  given  later  and  the  readers  can  check  them,  using  the 
procedure  described. 

From  the  reduced  graph  Gu(Zr),  one  can  derive  a  state  graph  (Fig.  6a) 
in  which  a  state  is  a  binary  vector  representing  the  state  of  terminals  of  logic 
element  Zr.  This  state  is  LrDaxZr.  The  transition  from  one  state  to  another 
involves  a  single  variable  change,  and  it  corresponds  to  a  signal- transition 
in  the  reduced  graph  Ga{Zr)  of  Fig.  5a.  For  example,  the  state-transition 
0101  — »  0111  in  Fig.  6a  is  caused  by  signal  trcinsition  r-f  in  Gii[Zr).  The 
concurrent  transitions  of  and  Lr—  is  described  by  a  2-cube  of  states, 
containing  2  possible  sequence  of  state  changes  1000  — ♦  1100  — ►  0100  and 
1000  — ►  0000  — *  0100.  The  first  state  sequence  takes  places  if  Z?a+  occurs 
before  Lr  —  ,  the  second  takes  place  if  Lr—  occurs  before  Da+.  Since  the 
circuit  behaves  exactly  the  same  no  matter  which  transition  occurs  first,  it 
is  clear  that  these  two  cases  also  cover  the  case  when  Lr—  and  Da+  occur 
at  exactly  the  same  time.  This  is  how  concurrency  can  be  described  in  a 
state-based  fonnulation. 
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(a)  State  for  V.r 
derived  from  CiR('/.r) 


Figure  6:  Steps  in  the  transformation  from  a  reduced  graph  to  the  logic 
equation  for  node  Zr. 

An  output-conflict  exists  if  a  state  has  at  least  two  next-states  in  which 
values  of  the  output  variable  are  different.  In  this  state  graph,  state- 1000 
has  two  next-states  1100  and  0000,  and  they  both  contain  Zr  =  0,  thus 
there  is  no  output-con.flict.  This  state  graph  can  be  programmed  into  a 
type  of  K-map  called  transition  map  shown  in  Fig.  6b.  Each  entry  in  this 
map  corresponds  to  a  binary  representation  of  state  £-,Z?aiZ,;  arcs  between 
entries  are  walks  between  adjacent  neighbors  and  they  are  state  transitions 
given  by  the  state  graph.  In  order  to  transform  a  transition-map  into  a  K- 
map,  each. entry  is  replaced  by  its  next-state  value  of  Zr.  For  example,  in 
state  0111,  the  next-state  value  of  Zr  is  0,  thus  this  entry  in  the  transition 
map  is  replaced  by  a  0.  If  there  are  more  than  one  next-state  and  their 
values  of  Zr  are  different,  i.e.,  if  there  is  an  output-conflict,  it  may  not 
be  possible  to  determine  the  value  of  that  entry  in  the  K-map.  The  logic 
equation  of  Zr  can  be  found  from  this  K-map  to  be 

Zr  =  LrD^X. 

The  derivation  of  the  logic  equation  for  node  x  is  more  interesting  because 
this  logic  element  not  only  has  state  information  but  also  output-conflicts. 
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(b)  Tmnxilioit  map /or  x 


(c)  K- map  for  X 


(a)  State  araphfor  x 
JervinJ /ram  (ili(x) 


Figure  7:  Steps  in  the  transformation  from  a  reduced  graph  to  the  logic 
equation  for  node  x. 

The  state  diagram  derived  from  the  reduced  graph  Grt(i)  is  shown  in  Fig. 
7a.  The  concurrent  transitions  of  Za—  «ind  the  chain  Da - *■  x—  are  de¬ 

scribed  by  a  structure  consisting  of  two  2-cube3,  with  three  possible  allowed 
state  sequences  {( 1 1 1 , 0 1 1 , 00 1 , 000) ,  ( 1 1 1 , 10 1 , 00 1 , 000) ,  ( 1 1 1 , 10 1 , 100 , 000) } . 
The  output  conflict  exists  in  state  ZaDaX  =  101,  as  x  are  different  in  the 
next-states  001  and  100.  However,  this  output-conflict  is  caused  by  concur¬ 
rent  transitions  of  an  input  signal  Za  and  the  output  i.  In  determining  the 
value  of  X  for  the  K-map  in  state  101,  transition  due  to  x—  leading  to  state 
100  must  be  chosen  over  transition  101  — ♦  001  because  the  latter  is  caused 
by  input  transition  Za—.  The  state  graph  in  Fig.  7a  shows  that  regardless 
of  whether  one  is  in  state  101  or  001,  transition  x—  will  always  occur  next 
and  the  circuit  behaves  exactly  the  same.  The  transition  map  in  Fig.  7b 
doesnot  contain  the  transition  101  — *  001.  The  K-map  derived  from  this 
state  graph  is  shown  in  Fig.  7c,  the  logic  equation  is  found  to  be 

I  =  ZaDa  +  xZ?o. 

This  equation  has  the  general  form  i  =  5  -t-  xR  with  5  =  ZaDa  and 
R  =  Da,  its  implementation  is  a  set-reset  flipflop  whose  output  is  x,  the 
set  and  reset  inputs  are  Z„Da  and  Da,  respectively.  Note  that  in  this 
implementation,  it  is  required  that  S.R  =  0  at  all  time. 

Similarly,  the  same  procedure  can  be  applied  to  other  reduced  graphs 
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Figure  8:  The  Gnal  circuit  realization  of  the  self-timed  controller. 


to  obtain  the  logic  equation  for  L,  and  Dr.  They  are  Lr  =  DrxZa  and 
Dr  =  Zr  +  La  +  DrX.  Thc  equation  for  Dr  can  be  rewritten  as  Dr  =  S  +  DrR 
with  S  =  Zr  +  La  JUid  iZ  =  I,  and  it  ia  implemented  as  a  set-reset  flipflop. 
The  reduced  graph  of  Dr  in  Fig.  5c  shows  that  there  is  a  time  period  during 
which  both  Zr  and  z  are  high,  causing  both  the  set  and  reset  inputs  of  the 
Dr  flipflop  to  be  active.  However,  it  also  indicates  that  output  Dr  ia  not  to 
be  reset  until  after  both  Zr  and  La  go  low,  and  therefore,  until  after  the  set 
input  goes  low.  Thc  implementation  of  this  flipflop  is  a  a et- dominant  one, 
as  indicated  by  an  asterisk  in  Fig  8.  On  the  other  hand,  one  can  choose  to 
implement  Dr  directly  from  the  equation  given  above  instead  of  a  set-reset 
flipflop  and  not  worrying  about  this  particular  detail. 

Finally,  by  putting  all  these  elements  together,  one  obtains  the  control 
circuit  for  the  A/D  converter  as  shown  in  the  dashed  box  of  Fig.  8.  The 
readers  should  be  able  to  verify  that  the  self-timed  control  circuit  shown 
is  speed-independent,  i.e.,  it  operates  correctly  with  any  combination  of 
delays  of  logic  gates,  assuming  that  the  internal  feedback  delays  of  flipflops 
are  negligible  compared  to  other  loop  delays  in  the  control  circuit. 


6.  Introducing  Internal  Node  x 

The  synthesis  procedure  of  the  controller  from  a  welt-formed  STG  spec¬ 
ification  have  been  described.  We  now  return  to  the  original  STG  specifi¬ 
cation  in  Fig.  4a  and  explain  how  the  internal  node  x  is  inserted  into  this 
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graph  to  give  the  well- formed  specification  in  Fig.  db.  The  STG  in  Fig.  4a 
has  non-input  nodes  n.nd  their  input  sets  are 

nDr)^{L,„Zr} 

I{Lr)  =  {Dr,D,,Z,} 

I{Zr)^{L,,D,,Z,}. 

Their  reduced  graphs  derived  from  the  above  STG  are  shown  in  Fig.  9.  The 
state  graphs  of  Gii{Dr)  and  G {i{Lr)  are  Figs.  9d  and  e,  respectively.  The 
reduced  graph  of  Dr  (Fig.  9a)  shows  that  there  is  an  instance  of  consecutive 
transitions  of  node  Zr  which  directly  precedes  the  output  transition  Dr 
{Zr+  —*  Zr—  —*  Dr  —  ).  Generally,  if  an  input  signal  to  a  logic  element 
changes  twice  without  any  intervening  transition  which  alters  the  state  of 
the  system,  a  hazard  may  result  at  the  output  of  that  logic  element.  The 
state  graph  of  Gn{Dr)  (Fig.  9d)  in  which  each  state  is  of  form  LaZrDr 
contains  two  instances  of  each  of  states  101  and  001.  In  state  101,  the 
output  Dr  is  not  enabled  and  Dr  does  not  make  a  transition  from  this 
state.  However,  state  001  has  an  output-conflict  in  the  output  variable  Dr 
because  the  next-states  of  001  are  011  and  000.  This  output-conflict  causes 
hazard  because  whenever  the  logic  element  Dr  gets  into  the  upper  state  001 
both  transitions  Dr—  and  Zr+  are  enabled.  If  Zr+  occurs  first,  the  state- 
sequence  001  — »  Oil  — ♦  001  — ►  000  will  take  place  and  logic  element  Dr 
behaves  correctly.  However,  if  the  delay  of  logic  element  Dr  is  smaller  than 
that  of  Zr,  then  transition  Dr—  will  occur  first  in  state  001,  even  before 
the  occurrence  of  the  chain  Zr-\ — *  Zr—.  This  is  a  malfunction.  In  the  case 
when  both  Zr  and  Dr  have  approximately  the  same  delay,  both  transitions 
Dr—  and  Zr+  can  occur  simultaneously,  resulting  in  a  hazard  at  node  Dr. 

The  reduced  graph  of  Lr  (Fig.  9b)  contains  a  similar  chain  of  transitions 
Za+  — ►  Za—  directly  preceding  an  output  transition  Lr+.  However,  this 
case  has  no  hazard  because  after  transition  Za+  occurs,  transitions  Za— 
and  Dr—  will  occur  concurrently.  The  occurrence  of  Dr—  changes  the  state 
of  the  logic  clement  before  retxirns  to  low.  Its  state  graph  (Fig.  9e) 
shows  that  output  transition  Lr+  is  enabled  in  the  unique  stateOOOO,  and 
there  is  no  output-conflict  in  this  case. 

Thus,  the  hazard  at  node  D,  is  caused  by  the  lack  of  internal  state 
information  to  discern  two  instances  of  state  001.  This  problem  shows 


up  in  the  reduced  graph  Gn(i^r)  ua  a  pair  of  consecutive  transitions  of 
the  same  node  Zf  In  order  to  remove  this  problem,  an  extra  transi* 
tion  such  as  i+  is  inserted  between  these  two  transitions.  Now,  tran¬ 
sition  X—  must  be  inserted  into  the  graph  to  preserve  its  liveness.  The 
reduced  graph  Gn(D,)  shows  that  i—  cannot  be  inserted  (i)  between  the 
pair  (£?r+,2,-i-)  or  —  because  this  only  produces  the  same  prob¬ 

lem  but  with  two  pairs  of  consecutive  transitions  of  the  same  nodes;  (ii) 
into  the  path  (D,+,  £a  — ,  Dr  —  )  because  i+,x—  become  concurrent  and  this 
would  violate  both  the  livencss  and  persistency  conditions.  Thus  x—  has 
to  be  inserted  into  the  path  (Z?,— ,  Ira+,  i?r+)-  Considering  the  original 
STG  of  Fig.  4a,  this  means  that  x—  must  be  inserted  into  the  path  contain 
transitions  (D,— ,D„— ,L,+,L„-i-,D,+).  Furthermore,  transitions  of  input 
nodes  Da,  La  and  Za  can  have  only  one  incident  arcs  coming  from  tran¬ 
sitions  of  their  corresponding  request  signals  Dr,Lr  and  Zr,  respectively. 
Thus  X—  can  be  inserted  between  (Pa— ,£.,+)  as  shown  in  Fig.  4b.  In 
this  final  well-formed  specification,  transition  i+  does  not  directly  precede 
transitions  of  node  D,;  however,  as  explained  earlier,  it  is  used  as  an  input 
to  logic  element  P,  to  eliminate  hazards  at  node  P,. 

Finally,  note  that  transition  i—  can  also  be  inserted  between  {La+,  Pr+) 
in  the  STG  in  Fig.  4a.  This  results  in  another  well-formed  graph,  from 
which  a  different  implementation  can  be  obtained  using  the  synthesis  steps 
described  above.  This  fact  indicates  that  the  implementation  is  sensitive  to 
the  particular  form  of  the  STG.  This  is  understandable  as  the  state  graphs 
extracted  from  STGs  are  unique  state-based  representations  of  the  behavior 
of  a  circuit. 
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(a)  Reduced  graph /or  Dr 


(b)  Reduced  graph/or  Lr 


(c)  Reduced  graph  for  7.r 


(d)  State  graph  of  Gr(Dr) 


Figure  9:  (a)-(c)  Reduced  graphs  of  nodes  Dr,Lr  and  Zr  derived  from  the 
STG  in  Fig.  4a.  (d)  State  graph  of  GaiDr).  (e)  State  graph  of  G/i(Lr). 
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Abstract:  Analog-to-digital  converters  that  must  produce  a  valid  output  in  a  speci¬ 
fied  period  of  time  are  subject  to  synchronizer  failure.  Three  types  of  A/D  converters  are 
cxaiuiiied:  flash  converters,  clocked  successive  appruxiiualion  converters,  .and  self-timed 
successive  .approximation  converters.  Lower  bounds  on  their  worst-case  conversion  time  as 
a  function  of  the  fault  probability  .are  derived. 

1.  Introduction 

The  finite  gain- bandwidth  product  of  the  comparators  used  in  analog- to-digital  (A/D) 
converters  gives  rise  to  fundamental  limits  on  the  speed  of  various  converter  architectures. 
While  the  average  delay  of  .m  A/D  converter  can  be  made  quite  fast,  the  worst-case 
conversion  time  of  any  A/D  converter  is  unbounded  [l].  This  is  because  there  is  always 
some  probability  that,  if  the  output  of  an  A/D  converter  is  used  in  a  synchronous  digital 
computer  system,  a  fatal  synchronizer  fault  can  occur.  Our  objective  is  to  express. a  lower 
bound  on  the  “worst-case”  A/D  conversion  time  as  a  function  of  this  fault  probability. 

The  synchronizer  problem  in  digital  systems  has  been  stitdied  for  many  years  [2-14]. 
Briefly  stated,  the  problem  occurs  when  an  asynchronous  signal  is  gated  into  a  clocked 
system.  The  clocked  system  must  decide,  in  a  specified  period  of  time,  the  state  of  the 
asynchronous  input.  The  circuit  that  must  make  this  decision  is  called  a  synchronizer. 

The  synchronizer  problem  occurs  because  the  synchronizer  circuit  can  take  ein  arbi¬ 
trarily  long  time  to  decide,  in  marginal  cases,  whether  the  input  is  above  or  below  a  given 
reference  standard.  In  most  physical  circuits  this  phenomenon  can  be  attributed  to  the 
gain-bandwidth  tradeoff.  Since,  at  any  given  time,  the  input  can  be  arbitrarily  close  to 
the  comparator  reference,  the  amplification  needed  to  turn  this  difference  into  a  full  logic 
swing  can  be  arbitrarily  large.  For  a  circuit  with  a  fixed  gain-b.andwidth  product,  the  delay 
incurred  in  achieving  this  enormous  amplification  can  grow  arbitrarily  large.  If  the  circuit 
is  required  to  produce  a  valid  output  by  a  certain  time,  then  there  is  a  finite  probability 
P  that  the  output  will  be  invalid  (not  a  logic  one  or  a  logic  zero)  at  that  time.  P  de¬ 
creases  exponentially  with  the  time  that  synchronizer  is  given  to  make  its  decision.  Many 
ingenious  tccliniques,  such  as  adding  hysteresis  or  noise,  have  been  tried  to  circumvent  the 
synchronizer  problem,  but  the  dilemma  appears  fundamental.  The  standard  solution  has 
become  to  jtist  wait  a  sufficient  period  of  time  that  P  is  acccpt.ably  small. 

*  This  research  w.as  supported  by  the  Defense  Adv.anced  Research  Projects  Agency  under  contract 
number  N000l4-80-C-0G'22. 


If  one  could  build  ;iu  A/D  couvcrlcr  that  could  be  guaranteed  to  always  produce  a 
v:did  digital  output  in  a  bounded  aniount  of  time,  then  one  could  use  this  converter  to 
build  a  fault-free  synchroni'/cr.  Iiuleed,  the  typical  synchroni/,er  is  nothing  more  than  a 
one  bit  A/I)  converter.  This  is  rc;ison  to  suspect  that  tlic  perfect  A/D  converter  docs 
not  exist.  Tognoni  has  examined  this  problem  experimentally  and  observed  mctastable 
problems  in  a  commercial  A/D  converter  [l5j.  In  the  remainder  of  this  paper  we  will 
examine  synchronizer  problems  in  A/D  converters  theoretically,  deriving  lower  bounds  on 
the  latency  of  the  A/D  conversion  as  a  function  of  the  reipiircd  reliability. 

2.  The  Model 

There  arc  many  circuit  modules  in  a  typical  A/D  converter  but  all  A/D  converters  make 
u.sc  of  one  or  more  comparators.  We  focus  our  attention,  in  this  paper,  on  the  speed  of 
the  comparator. 

There  are  basically  two  types  of  comparators.  The  simplest  takes  the  form  of  a  high 
gain  amplifier;  the  output  voltage  of  which  limits  at  zero  and  the  positive  power  supply 
voltage.  We  normalize  the  power  supply  voltage  to  1.  If  the  gain  of  the  high  gain  amplifier  is 
A,  then  the  output  of  the  comparator  is  undefined  (not  a  logic  0  or  1)  if  |uiu~Wrc-fl  <  1/(2A). 
Uju  is  the  input  voltage  and  is  the  comparator  reference  voltage.  Thus,  there  is  a  finite 
input  voltage  range  over  which  the  output  takes  an  infinitely  long  time  to  settle  and  if 
we  compute  the  average  comparator  response  time  over  all  input  voltages  (all  voltages  are 
assumed  equally  likely)  then  the  average  response  time  diverges  to  infinity.  The  delay  of 
a  well-designed  mxiltistage  amplifier  takes  the  form  Tioiay  =  I'ln  A,  where  r  is  a  constajnt 
of  the  technology.  (A  simple  way  to  view  this  is  to  cascade  two  high-gain  amplifiers.  The 
gains  multiply  and  the  delays  add.) 

A  bistable  (or  regenerative)  comparator  circuit  hcis  better  average  delay  properties. 
Figure  1  illustrates  a  simple  bistable  CMOS  comparator  circuit  clocked  on  (p.  The  delay 
T.iciay  of  this  circuit  is  roughly 

Tielay  —  ^hl|v;a  Vref|(  (l) 

where  v-m  and  Uref  are  normalized  to  the  power  supply  voltage  eind  r  is  roughly  equal  to  the 
inverse  of  the  gain-bandwidth  product  for  the  individual  inverters  (transistors  Mi  through 
M4).  We  normalize  the  delay  to  r.  Rewriting  (l)  in  nonnalized  form,  we  have 

T.Jclay  =  —  In  |A|,  (2) 

where  we  have  defined 

A  =  Win  -  Wref-  (3) 

The  delay  of  this  amplifier  diverges  only  logarithmically  with  A  and  therefore  has  a  finite 
average  delay.  We  will  confine  our  sUidy  to  this  second  type  of  comparator. 

The  average  speed  of  the  comparator  is  a  function  of  the  distance  between  adjacent 
bits.  For  an  N  bit  converter,  the  (nonnalized)  voltage  dilference  vi,it  between  adjacent 
bits  is 


Tlic  l>in:iry  fractions  raiiRo  from  0  to  I  - 

It  is  cotivuiiiciil  to  partition  tlic  input  voltage  into  a  digital  part  pins  an  analog  residue. 
VVe  re-express  Ui,i  Win  ••  ^  I’  w  where  -ui.u  <  <  ui.ji.  and  V  is  the  ideal  voltage 

representation  of  an  exact  /V-hit  binary  fraction,  such  as  the  foiir-bit  nnnibers  0.101  l-j  or 
0.0001 2-  The  conversion  time  is  always  a  function  of  v  and  may,  or  may  not,  be  a  function 
of  V',  depending  on  the  converter  arcliitecture. 

In  the  remainder  of  tliis  paper  we  will  Investigate  two  tyjics  of  delay.  The  Jiverage 
delay  is  deCned  as  the  average  of  the  conversion  delays  which  would  be  observed 
if  one  averaged  the  times  required  to  perform  a  largo  number  of  conversions,  where  we 
have  assumed  that  all  mput  voltages  Ui,,  arc  equally,  likely.  Wc  define  the  worst-case  delay 
Ty,„T»t-v.w  in  terms  of  the  fault  probability  P.  If  wc  require  P  =  0  then  oo- 

What  wc  mean  by  the  worst-case  conversion  time  is  that  if  we  observe  a  very  large  number 
of  conversions  M  of  randomly  selected  input  voltages  (uniform  distribution)  and  we  throw 
out  the  slowest  PM  of  these  conversions  as  causing  faults,  then  the  worst-case  conversion 
time  Tw,,r.st-,iu«*  is  the  slowest  conversion  time  in  the  remaining  set  of  (1  —  P)M  better 
conversions. 

3.  Flash  Converter 

For  the  flash  converter,  there  exists  some  comparator  for  which  the  input  voltage  is  within 
±Wbit/2  of  the  reference  voltage.  For  this  comparator  we  may  write  A  =  u.  Assume  that 
all  valiies  of  v  are  equally  likely  over  the  interval  (—ui, it/2,  Vbit/2].  The  average  value  of 
T.jcUy  13  given  by 


Solving,  we  obtain 

Thvk  =  1  +  (iV  -f- 1)  In  2  (6) 

for  the  flash  converter.  The  minimum  average  delay  increases  linearly  with  N  for  a  given 
technology. 

If  V  =  0,  Tjciay  — *  oo-  Given  a  specified  value  of  Tliday,  there  is  some  range  of  v  over 
which  the  comparator  output  fails  to  settle  in  time.  When  this  happens  we  say  that  we 
have  a  fault.  Let  us  denote  the  biterval  of  fault  causing  voltages  by  —5<2v<  5.  The 
probability  of  a  fault  P  is  given  by 


P=^/V\nt’  (7) 

We  can  bound  the  worst-case  delay  Tworst-caae  to  be 

Tworst— caac 

The  interpretation  of  (8)  is  that  if  we  perform  conversions  of  M  independent  input  voltages, 
we  expect  Ti.-iay  to  be  greater  than  Tworat-ciiac  for  »it  least  PM  conversions.  Solving  (8), 
we  obtain  ■ 

T 

worst  — Tiutc 


>  Wln2-ln>P. 


(9) 


The  vjilnc  of  P  for  wliicli  ~  Tavu  is  l/‘2e.  Note  UuiL  Ihc  dopctultMicc  of  (.he 

delay  on  N  bccoiiies  rcialivcly  less  iinportaiiL  :is  P  is  made  smaller.*  I''i(;ure  2  illusLrales 
(9)  for  three  dilferciit  values  of  P . 

4.  Clocked  Successive  Approximation  Converter 

Kor  a  clocked  successive  approximation  A/D  converter,  the  average  latency  is  simply  N 
times  the  clock  period.  The  length  of  the  clock  period  depends  on  the  fault  probability 
one  is  willing  to  accept.  The  probability  of  fault  in  a  noiseless  succc.ssive  approximation 
converter  is  simply  the  probability  of  not  successfully  completing  a  conversion  on  any  one 
clock  cycle.  Note  that  this  probability  is,  because  the  system  was  assumed  ideal  and 
noiseless,  just  the  probability  of  faulting  on  the  IVth  conversion.  This  was  derived  for  the 
flash  converter.  Since  there  are  N  conversions,  we  have 

Tw„r«t-r.u«.  >  iV2ln2 -iVln>f.  (10) 

Equation  (10)  is  plotted  in  Fig.  3. 

5.  Seif-timed  Successive  Approximation  Converter 

We  can  observe  from  (1)  that  the  comparator  delay  is  a  hinction  of  the  initial  voltage 
dilTerence.  As  pointed  out  by  H.-S.  Lee  [16],  not  all  conversion  steps  in  a  successive  ap¬ 
proximation  converter  can  be  within  ±t;bu/2  of  the  reference  voltage.  This  means  that 
some  conversions  will  be  fast  and  some  slow.  Figure  4  illustrates  the  block  diagraiu  of  a 
self-timed  successive  approximation  converter.  The  asynchronous  logic  for  this  figure  was 
designed  by  Mr.  Tam-Anh  Chu  and  is  documented  in  [17|.  The  comparator  is  balanced  at 
the  beginning  of  each  conversion.  The  end  of  each  conversion  is  sensed  by  a  mutual  exclu¬ 
sion  circuit  [18] .  (Because  our  analysis  ignores  the  significant  overhead  time  of  balancing 
the  amplifier,  sensing  the  completion,  and  changing  Uref)  these  bounds  are  not  as  tight  iis 
the  ones  in  the  previous  section.  Thus  comparisons  of  seif-timed  and  clocked  successive 
approximation  converters  should  be  done  with  care.) 

On  each  conversion  k,  a  binary  number  Bk  is  tested.  Let  Dk  =  Bk  —  V  he  the 
difference  between  the  present  test  value  Bk  and  the  final  binary  voltage.  Dk  can  be  either 
positive  or  negative  and  is  bounded  by  [Ditl  <  2”*',  where  1  <  k  <  N .  The  delay  Tk  of 
the  fcth  conversion  (considering  only  the  comparator  delay  contribution)  is  given  by 

Tfc  = -In  |£>fc  -  v|.  (11) 

Summing  over  all  N  conversions, 

N 

T.ioUy  =  -  ^  In  [Dfe  -  u|.  (12) 

fc=i 

*  While  the  average  performance  of  the  amplifier  type  comparator  is  poor,  the  worst- 
case  performance  is  similar  to  that  of  the  bistable  comparator.  This  is  because,  for  the 
same  level  of  reliability,  they  both  are  reqiiircd  to  provide  amplification  on  the  order  of 
l/(>'’vuit). 


The  avcniKo  conversion  time  7’ivk  foniui  by  averaging  (12)  over  all  v  and  all  V .  The 
averaging  over  v  can  be  done  in  closed  form.  Given  a  particular  K, 


N 


T„^{N,V)  =  Y,Tot, 


(13) 


fc=l 


where  we  have  dcHned  the  conversion  time  for  one  bit  Tpk  ^ 


Tofc  = - /  In  iDfc -- u|dt;. 

Vliit  J 


(14) 


Solving(14),  we  obtain 


Tok 


J '  -  { (l^h  0 ^  x)  -  (II^N) 


Dk  is  a  function  of  V  and  N.  Averaging  (13)  over  all  V  we  obtain 


1  -  In  for  Dk  =  0 

(15) 


r.v.(iV)  =  2-''‘£ 

V=0  fc=l 


(16) 


The  worst-czise  delay  is  given  by 


AT 


^worst— ca3e(^)  ^  maX  ^  ^  In  Vi}itP|. 

0<V<1 


(17) 


fc=l 


A  program  was  written  to  preform  exhaustive  tonalysis,  for  all  V  from  fV  =  1  to  14,  on 
(16)  and  (17).  For  (17),  P  was  set  to  0.1,  10“®,  zuid  10“'^.  Figure  5  illustrates  the  worst- 
case  and  average  behavior  of  the  self-timed  successive  approximation  converter.  Table  1 
gives  the  numbers  V  which  caused  the  worst-case  performance  in  (17).  In  many  cases 
there  me  several  numbers  which  cause  equivalently  slow  behavior.  The  smallest  of  these 
is  listed.  Empirically,  numbers  of  the  fonn  ...010101  always  caused  worst-case  behavior. 
For  N  in  the  region  14-20  and  P  in  the  region  of  10~®  to  10“^^,  the  lower  bound  on  the 
self-timed  successive  approximation  converter  speed  was  about  five  times  slower  than  a 
flash  converter  constructed  in  the  same  technology. 


Table  1.  Worst-case  V  versus  N 


N 

V  (biisc  2) 

2 

0.0 1 

3 

0.0 11 

4 

0.0101 

5 

o.oiioi  • 

6 

O.OIOIOl 

7 

0.0110101 

8 

O.OIOIOIOI 

9 

O.OIIOIOIOI  • 

10 

O.OIOIOIOIOI 

11 

O.OIIOIOIOIOI 

12 

O.OIOlOiOIOlOl 

13 

O.OIIOIOIOIOIOI 

14 

O.OIOIOIOIOIOIOI 

6.  Conclusions 

Tkree  types  of  converters  were  studied  with  respect  to  average  and  worst-case  delay.  In 
all  cases,  the  worst-case  latency  was  shown  to  increase  with  increasing  reliability.  Note 
that  while  latency  is  a  function  of  P,  the  throughput  need  not  decrease  with  decreasing 
P,  providing  one  is  willing  to  add  extra  pipeline  hardware  to  give  the  synchronizers  tinae 
to  settle. 

We  can  divide  A/D  applications  into  two  classes.  Those  in  which  latency  is  important 
(such  as  industrial  control  applications  in  which  the  A/D  is  in  a  feedback  path  with,  say, 
a  microprocessor)  and  those  in  which  throughput  is  important  (such  as  for  image  process¬ 
ing).  For  the  first  class  of  applications,  one  can  anticipate  that  catastrophe  avoidance  will 
mandate  high  reliability  requirements.  This  implies  a  small  P ,  in  conflict  with  system  sta¬ 
bility  requirements  for  low  latency.  For  pipelined  signal  processing  applications,  additional 
pipeline  stages  can  easily  be  added  to  reduce  P  to  an  acceptable  level. 

One  of  the  more  intriguing  results  of  this  study  is  that  the  self-timed  successive- 
approximation  converter  begins  to  become  competitive,  in  terms  of  speed,  with  flash 
converters  for  very  low  P  and  large  N.  This  is  because,  for  low  P,  both  the  flash  and 
self- timed  successive-approximation  converters  spend  most  of  their  time  working  on  the 
one  hard  (u  >C  Ubit)  bit  that  must  be  assumed  to  be  there. 
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Figures 

1  CMOS  regenerative  comparator  zeroed  when  4>  is  liigh  and  evaluated  when  0  is  low. 
The  pass  devices  are  contain  both  p-  and  n-channel  transistors. 

2  Graph  of  lower  bounds  on  the  normalized  delay  of  a  Ihish  A/D  converter.  In  the  figure 

TiivRj  and  Tworst-ciuio  for  three  different  values  of  P. 

3  Graph  of  lower  bounds  on  the  worst-case  behavior  of  a  conventional  successive  ap¬ 
proximation  A/D  converter  for  three  difTcrent  values  of  P. 

4  Block  diagram  of  a  self-timed  successive  approximation  converter.  The  conversion  is 
started  when  Req  goes  high.  The  controller  signals  completion  by  raising  Ack.  ME  is 
a  mutual  exclusion  circuit,  DAC  is  a  D/A  converter,  C/L  is  combinational  logic,  and 
the  delay  line  is  calibrated  to  mimic  the  worst-case  path  through  the  combinational 
logic  and  DAC. 

5  Normalized  lower  bounds  on  the  delay  through  a  self-timed  successive  approximation 
A/D  converter. 
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Fat-Trees;  Universal  Networks  for  Hardware-Efficient 

Supercomputing 
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Abstract  — This  paper  presents  a  new  class  of  universal  routing 
networks  called  fat-trees,  which  might  be  used  to  interconnect  the 
processors  of  a  general-purpose  parallel  supercomputer.  A  fat- 
tree  routing  network  is  parameterized  not  only  in  the  number  of 
processors,  but  also  in  the  amount  of  simultaneous  commu¬ 
nication  it  can  support.  Since  communication  can  be  scaled  inde¬ 
pendently  from  number  of  processors,  substantial  hardware  can 
be  saved  over,  for  example,  hypercube-based  networks,  for  such 
parallel  processing  applications  as  finite-element  analysis,  but 
without  resorting  to  a  special-purpose  architecture. 

Of  greater  interest  from  a  theoretical  standpoint,  however,  is  a 
proof  that  a  fat-tree  of  a  given  size  is  nearly  the  best  routing 
network  of  that  size.  This  universality  theorem  is  proved  using  a 
three-dimensional  VLSI  model  that  incorporates  wiring  as  a  di¬ 
rect  cost.  In  this  model,  hardware  size  is  measured  as  physical 
volume.  We  prove  that  for  any  given  amount  of  communications 
hardware,  a  fat-tree  buUt  from  that  amount  of  hardware  can 
simulate  every  other  network  built  from  the  same  amount  of 
hardware,  using  only  slightly  more  time  (a  polylogarithmic  factor 
greater).  The  basic  assumption  we  make  of  competing  networks  is 
the  following.  In  unit  time,  at  most  0(a)  bits  can  enter  or  leave  a 
closed  three-dimensional  region  with  surface  area  a.  (This  paper 
proves  the  universality  result  for  off-line  simulations  only.) 

Index  Terms  —  Fat-trees,  interconnection  networks,  parallel  su¬ 
percomputing,  routing  networks,  universality,  VLSI  theory. 

I.  Introduction 

Most  routing  networks  for  parallel  processing  super¬ 
computers  have  been  analyzed  in  terms  of  per¬ 
formance  and  cost.  Performance  is  typically  measured  by 
how  long  it  takes  to  route  permutations,  and  cost  is  measured 
by  the  number  of  switching  components  and  wires.  This 
paper  presents  a  new  routing  network  called  fat-trees,  but 
analyzes  it  in  a  somewhat  different  model.  Specifically,  we 
use  a  three-dimensional  VLSI  model  in  which  pin  bounded¬ 
ness  has  a  direct  analog  as  the  bandwidth  limitation  imposed 
by  the  surface  of  a  closed  three-dimensional  region.  Per¬ 
formance  is  measured  by  how  long  it  takes  to  route  an  arbi¬ 
trary  set  of  messages,  and  cost  is  measured  as  the  volume  of 
a  physical  implementation  of  the  network.  We  prove  a  uni¬ 
versality  theorem  which  shows  that  for  a  given  volume  of 
hardware,  no  network  is  much  better. 

Unlike  a  computer  scientist's  traditional  notion  of  a  tret, 
fat-trees  are  more  like  real  trees  in  that  they  get  thicker 
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further  from  the  leaves.  In  physical  structure,  a  fat-tree 
resembles,  and  is  based  on,  the  tree  of  meshes  graph  due  to 
Leighton  [12],  [14],  The  processors  of  a  fat-tree  are  located 
at  the  leaves  of  a  complete  binary  tree,  and  the  internal  nodes 
are  switches.  Going  up  the  fat-tree,  the  number  of  wires 
connecting  a  node  with  its  father  increases,  and  hence  the 
communication  bandwidth  increases.  The  rate  of  growth  in¬ 
fluences  the  size  and  cost  of  the  hardware  as  well. 

Most  networks  that  have  been  proposed  for  parallel  pro¬ 
cessing  are  based  on  the  Boolean  hypercube,  but  these  net¬ 
works  suffer  from  wirability  and  packaging  problems  and 
require  nearly  order  physical  volume  to  interconnect  n 
processors.  In  his  influential  paper  on  “ultracomputers"  [27], 
Schwartz  demonstrates  that  many  problems  can  be  solved 
efficiently  on  a  supercomputer-based  on  a  shuffle  network 
[28],  But  afterwards,  Schwartz  comments,  “The  most  prob¬ 
lematic  aspect  of  the  ultracomputer  architecture  suggested  in 
the  preceding  section  would  appear  to  be  the  very  large  num¬ 
ber  of  intercabinet  wires  which  it  implies.”  Schwartz  then 
goes  on  to  consider  a  “layered”  architecture,  which  seems 
easier  to  build,  but  which  may  not  have  all  the  nice  properties 
of  the  original  architecture. 

On  the  other  hand,  there  are  many  applications  that  do  not 
require  the  full  communication  potential  of  a  hypercube- 
based  network.  For  example,  many  finite-element  problems 
are  planar,  and  planar  graphs  have  a  bisection  width  of  size 
0(Vn),  as  was  shown  by  Lipton  and  Tarjan  [19],  Moreover, 
any  planar  interconnection  strategy  requires  only  0{n)  vol¬ 
ume.  Thus,  a  natural  implementation  of  a  parallel  finite- 
element  algorithm  would  waste  much  of  the  communication 
bandwidth  provided  by  a  hypercube-based  routing  network. 

Fat-trees  are  a  family  of  general-purpose  interconnection 
strategies  which  effectively  utilize  any  given  amount  of  hard¬ 
ware  resource  devoted  to  communication.  This  paper  proves 
that  for  a  given  physical  volume  of  hardware,  no  network  is 
much  better  than  a  fat-tree.  Section  II  introduces  fat-tree 
architectures  and  gives  the  logical  structure  of  one  feasible 
implementation.  Section  III  shows  how  communication  on  a 
fat-tree  can  be  scheduled  off-line  in  a  near-optimal  fashion. 
Section  IV  defines  the  class  of  universal  fat-trees  and  in¬ 
vestigates  their  hardware  cost  in  a  three-dimensional  VLSI 
model.  Section  V  contains  several  combinatorial  theorems 
concerning  the  recursive  decomposition  of  an  arbitrary  rout¬ 
ing  network,  and  Section  VI  uses  these  results  to  demonstrate 
that  fat-trees  are  indeed  a  class  of  hardware-efficient  univer¬ 
sal  routing  networks.  Finally,  Section  VII  offers  some  re¬ 
marks  about  the  practicality  of  fat-trees. 
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11.  Fat-Trees 

This  section  introduces  fat-trees  as  a  routing  network  for 
parallel  computation.  The  parallel  computer  based  on  fat- 
trees  that  we  present  is  somewhat  arbitrary  and  is  influenced 
by  the  various  connection  machine  projects  [5] ,  [9] ,  [  1 1  ]  con¬ 
ceived  at  M.l.T.  The  computational  model  is  not  meant  to  be 
exclusive  —  the  results  in  this  paper  undoubtedly  apply  to 
more  general  models.  Moreover,  arbitrary  “engineering  de¬ 
sign  decisions,"  which  may  not  be  the  best  choices  from 
either  a  practical  or  a  theoretical  perspective,  have  been  made 
in  this  description  of  fat-trees.  Most  of  the  choices  influence 
the  results  by  only  a  logarithmic  factor,  however,  and  do  not 
affect  the  overall  thrust  of  the  paper — the  universality  theo¬ 
rem  in  Section  VI. 

The  intuitive  model  for  parallel  computation  that  we  use  is 
a  parallel  computation  engine  composed  of  a  set  of  processors 
interconnected  by  a  routing  network.  The  processors  share 
no  common  memory,  and  thus  they  must  communicate 
through  the  routing  network,  using  messages.  The  Job  of  the 
routing  network  is  to  see  that  all  messages  eventually  reach 
their  destinations  as  quickly  as  possible. 

A  fat-tree  FT  is  a  routing  network  based  on  a  complete 
binary  tree.  (See  Fig.  1.)  A  set  F  of  n  processors  is  located 
at  the  leaves  of  the  fat-tree.  Each  edge  of  the  underlying  tree 
corresponds  to  two  channels  of  the  fat-tree:  one  from  parent 
to  child,  the  other  from  child  to  parent.  Each  channel  consists 
of  a  bundle  of  wires,  and  the  number  of  wires  in  a  channel  c 
is  called  its  capacity,  denoted  by  cap(c).  The  capacities  of 
channels  in  the  routing  network  are  determined  by  how  much 
hardware  we  can  afford,  a  topic  to  be  discussed  in  Section  IV. 
The  channel  leaving  the  root  of  the  tree  corresponds  to  an 
interface  with  the  external  world.  Each  (internal)  node  of  the 
fat-tree  contains  circuitry  that  switches  messages  between 
incoming  channels  and  outgoing  channels. 

Messages  produced  by  processors  are  batched  into  deliv¬ 
ery  cycles.  During  a  delivery  cycle,  a  processor  may  send 
messages  through  the  network  to  other  processors.  Some 
messages  may  be  lost  in  the  routing  network  during  a  delivery 
cycle.  Thus,  in  general,  at  the  end  of  the  delivery  cycle, 
acknowledgments  are  sent  from  the  destination  processor 
back  to  the  source  processor.  Messages  that  are  not  delivered 
must  be  sent  again  in  subsequent  delivery  cycles. 

The  nodes  of  the  fat-tree  accomplish  most  of  the  switching. 
In  order  to  understand  their  function,  one  must  first  under¬ 
stand  how  the  routing  of  messages  is  accomplished.  A  mes¬ 
sage  set  M  C  F  X  F  is  a  set  of  fftersflger  (/,;).  If(r,y)  G  M, 
then  processor  i  has  a  message  to  be  sent  to  processor  j.  (We 
omit  details  concerning  the  contents  of  messages  and  the 
handling  of  messages  routed  to  and  from  the  external  inter¬ 
face.)  Routing  in  the  fat-tree  is  basically  easy  since  every 
message  has  a  unique  path  in  the  underlying  complete  binary 
tree.  A  message  going  from  processor  /  to  processor  j  goes 
up  the  tree  to  their  least  common  ancestor  and  then  back  down 
according  to  the  least  significant  bits  of  j.  Notice  that  at  any 
node  of  the  fat-tree,  there  are  only  two  choices  for  the  routing 
of  a  message.  If  it  comes  into  a  node  from  a  left  subtree,  for 
example,  it  can  only  go  up  or  down  to  the  right.  Thus,  a  bit 


Fig.  I .  The  organization  of  a  fat-tree.  Processors  are  located  at  the  leaves,  and 
the  internal  nodes  contain  concentrator  switches.  The  capacities  of  channels 
increase  as  we  go  up  the  tree. 

String  of  length  at  most  2  Ig  n  is  sufficient  to  represent  the 
destination  of  any  message.' 

We  shall  consider  communication  through  the  fat-tree  net¬ 
work  to  be  synchronous  and  bit  serial.  Messages  snake 
through  the  tree  with  leading  bits  of  a  message  establishing 
a  path  for  the  remainder  to  follow.  Since  some  of  the  paths 
through  the  tree  are  longer  than  others,  synchronization  of  the 
departures  and  arrivals  of  messages  can  be  a  bit  tricky.  Buff¬ 
ering  of  messages  by  the  sending  processors  is  one  solution 
to  this  problem.  (As  was  mentioned  before,  there  are  many 
other  engineering  alternatives  that  lead  to  the  same  kinds  of 
theoretical  results  reported  here.)  The  differing  lengths  of 
paths  in  the  fat-tree  are  actually  a  major  advantage  of  the 
network  because  messages  can  be  routed  locally  without 
soaking  up  the  precious  bandwidth  higher  up  in  the  tree, 
much  as  telephone  communications  are  routed  within  an  ex¬ 
change  without  using  more  expensive  trunk  lines. 

The  messages  in  the  network  obey  the  bit-serial  protocol 
shown  in  Fig.  2.  The  first  bit  is  the  M  bit,  which  tells  whether 
the  remaining  bits  actually  contain  a  message.  Next  come  the 
address  bits,  which  name  the  destination  processor.  The  final 
field  in  the  message  format  is  the  data  themselves.  As  mes¬ 
sages  are  routed  through  the  network,  each  node  uses  the  M 
bit  to  identify  whether  a  wire  carries  a  message,  and  it  uses 
the  first  address  bit  to  make  a  routing  decision.  A  path  is 
established  through  the  node  for  a  new  M  bit  and  the  remain¬ 
ing  message  bits  to  follow.  The  address  bits  are  stripped  off 
one  by  one  as  the  message  establishes  a  path  through  the 
network. 

A  fat-tree  node  has  three  input  ports,  (//,  L,,  and  R,,  and 
three  output  ports,  Uq,  Lq,  and  Fq,  connected  in  the  natural 
way  to  the  wires  in  the  channels.  Messages  entering  input 
port  Lf  will  go  either  to  output  port  Uo  or  to  output  port  Rq. 
The  logic  of  the  switching  circuitry  in  a  node  consists  of  three 
similar  portions,  shown  in  Fig.  3.  A  wire  from  an  input  port 
is  fanned  out  towards  the  two  opposite  output  ports.  The  M 
bit  of  each  wire  is  then  examined  to  determine  whether  the 
wire  has  a  message.  On  the  next  clock  tick,  the  first  address 
bit  is  examined  on  both  branches  of  the  input  wire.  By  ANOing 

'We  use  the  notation  Ig  n  to  mean  ma.x(l .  log;  n}. 
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Fig.  2.  The  format  of  bit-serial  messages.  The  first  bit  that  a  switch  sees  is  the 
M  bit,  which  indicates  whether  an  input  wire  actually  contains  a  message.  The 
address  bits  arrive  bit-serially  in  subsequent  time  steps,  and  the  message 
contents  are  last. 


Uo  Ui 


Fig.  3.  The  internal  structure  of  a  fat-tree  node.  A  selector  determines  which 
messages  are  destined  for  an  output  pon.  and  then  a  concentrator  switch 
establishes  disjoint  electrical  paths  for  as  many  of  those  messages  as  possible. 


the  M  bit  with  either  the  address  bit  or  its  complement,  an 
M  bit  is  determined  for  each  branch  by  a  selector.  Next,  the 
messages  destined  for  an  output  port,  which  currently  occupy 
many  wires,  are  switched  onto  fewer  wires  by  a  concentrator 
switch. 

The  job  of  the  concentrator  switch  is  to  create  electrical 
paths  from  those  input  wires  that  carry  messages  to  fewer 
output  wires.  Obviously,  if  there  are  more  input  messages 
than  output  wires,  some  messages  will  be  lost.  In  this  case  we 
shall  say  that  the  output  channel  is  congested.  We  have 
already  mentioned  an  acknowledgment  mechanism  that  de¬ 
tects  when  messages  are  lost  due  to  congestion. 

For  the  time  being,  we  shall  assume  that  the  concentrator 
switch  has  the  following  property.  If  there  is  no  congestion  — 
that  is,  the  number  of  input  messages  does  not  exceed  the 
number  of  output  wires  —  then  no  messages  are  lost.  The 
concentrator  switch  that  we  shall  present  in  Section  IV  is  a 
partial  concentrator  and  does  not  have  exactly  this  property, 
but  it  makes  little  difference  to  the  theoretical  results.  This 
circuit  has  0{m)  components  if  there  are  a  total  of  m  incident 
wires,  and  it  switches  in  constant  time.  Thus,  the  time  re¬ 
quired  for  an  entire  delivery  cycle  in  a  fat-tree  of  n  processors 
is  0(lg  n). 

Although  we  have  described  the  general  setting  as  an  on¬ 
line  switching  environment,  this  paper  makes  the  simplifying 
assumption  that  the  fat-tree  nodes  contain  off-line  circuitry, 
in  that  the  switches,  although  dynamically  set,  have  their 
settings  predetermined  by  an  off-line  scheduling  algorithm. 
Naturally,  it  would  be  better  to  dynamically  determine  the 
settings  themselves  in  real  time,  and  indeed,  it  is  possible  to 
build  such  on-line  switches,  but  these  results  will  be  reported 


There  are  several  consequences  of  the  off-line  assumption 
that  bear  mention,  however.  For  example,  the  results  apply  to 
practical  situations  when  the  settings  of  switches  can  be 
“compiled,"  as  when  simulating  a  large  VLSI  design  or  emu¬ 
lating  a  fixed-connection  network.  Also,  some  of  the 
mechanisms  —  such  as  acknowledging  the  receipt  of  mes¬ 
sages,  which  is  necessary  in  an  on-line  environment  —  can  be 
omitted  from  the  off-line  hardware  structure,  thereby  reduc¬ 
ing  the  complexity  of  the  design. 


III.  Off-Line  Scheduling  on  Fat-Trees 

The  concentrator  switches  in  the  nodes  of  a  fat-tree  routing 
network  guarantee  that  no  messages  are  lost  unless  there  is 
congestion.  This  section  gives  an  algorithm  for  scheduling 
the  delivery  of  an  arbitrary  set  of  messages  so  that  all  mes¬ 
sages  will  be  delivered.  We  give  a  simple  value,  called  the 
load  factor  of  a  set  of  messages,  which  provides  a  lower 
bound  on  how  quickly  the  messages  can  be  delivered.  We 
show  that  for  an  arbitrary  message  set,  off-line  scheduling 
can  be  done  optimally  to  within  a  logarithmic  factor  of  the 
number  of  processors. 

Let  us  be  more  precise  about  the  off-line  scheduling  prob¬ 
lem.  Let  FT  be  a  fat-tree  on  n  processors,  and  let  C  be  the  set 
of  channels  in  FT.  For  any  channel  c  £  C,  the  capacity 
cap(c),  which  is  the  number  of  wires  in  the  channel,  is  also 
the  maximum  number  of  simultaneous  messages  the  channel 
can  support  because  we  are  assuming  bit-serial  communica¬ 
tion.  Since  each  message  between  two  processors  determines 
a  unique  path  in  the  underlying  complete  binary  tree,  we  can 
define  load(A/,c)  to  be  the  total  number  of  messages  in  a 
message  set  M  that  must  go  through  channel  c.  We  call  M  a 
one-cycle  message  set  if  load(/Vf,  c)  s  cap(c)  for  all  channels 
c  G  C.  If  all  capacity  constraints  are  met,  a  fat-tree  with 
ideal  concentrator  switches  can  route  every  message  in  one 
delivery  cycle. 

A  schedule  of  a  message  set  M  is  a  partition  of  M  into 
one-cycle  message  sets  Mu  M2,  •  •  •  ,M^  where  d  is  the  total 
number  of  delivery  cycles.  A  simple  lower  bound  one/  for  an 
arbitrary  message  set  M  is  d  ^  max^floadfA/,  c)/cap(c-)), 
which  leads  to  the  following  definition. 

Definition:  Let  M  be  a  message  set,  and  let  c  G  C  be  a 
channel  in  a  fat-tree.  The  load  factor  \(M,c)  of  a  channel  c 
due  to  M  is 


A(M,c) 


load(A/,  c) 
cap(c) 


and  the  load  factor  of  the  entire  fat-tree  due  to  M  is 


X{M)  =  max  A(M,  c) . 
rec 


A  message  set  M  is  a  one-cycle  message  set  if  \{M)  ^  1. 

The  simple  lower  bound  on  the  number  d  of  delivery  cycles 
required  for  any  schedule  of  M  can  now  be  reexpressed  as 
d  2  \{M).  The  next  theorem  shows  that  this  lower  bound 


elsewhere  [8].  We  have  chosen  here  to  prove  the  weaker  can  be  achieved  to  within  a  logarithmic  factor  of  n. 
off-line  results  so  as  to  simplify  the  presentation  of  the  um-  Theorem  I :  Let  FT  be  a  fat-tree  on  n  processors,  and  let 
versality  theorem  in  Section  VI.  C  be  the  set  of  channels  in  FT.  Then  for  any  message  set  M 
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with  \(M)  >  ] ,  there  is  an  off-line  schedule  M,.  M2.  ■  •  ■  ,Mj 
such  that  d  =  0{\(M)  Ig  n). 

Proof:  The  idea  is  to  partition  the  messages  going  from 
left  to  right  through  the  root  of  the  fat-tree  into  at  most  lk(M) 
one-cycle  message  sets,  to  do  the  same  for  the  messages 
going  from  right  to  left,  and  then  to  recursively  partition  the 
messages  in  the  two  subtrees  of  the  root.  Let  Q\  be  the  subset 
of  M  consisting  of  those  messages  that  must  go  through  the 
root  from  left  to  right.  The  scheduling  algorithm  will  begin 
by  partitioning  Qy  into  two  message  sets  (2:  and  Qj-  It  then 
iteratively  refines  each  Q*  into  and  Qy+i,  until  each  Q*, 
k  -  r,  •••  ,lr  -  \  \%  3.  one-cycle  message  set  for  some 
r  <  2X(M).  The  r  message  sets  Q„  -  •  • ,  C>-i  form  the  initial 
sequence  of  the  schedule. 

The  algorithm  similarly  partitions  the  message  set  con¬ 
sisting  of  messages  going  from  right  to  left  in  the  fat-tree  and 
adds  them  to  the  schedule.  (Each  of  these  message  sets  can, 
in  fact,  be  routed  at  the  same  time  as  one  of  the  Q*.)  Finally, 
the  algorithm  recursively  partitions  the  messages  remaining 
within  the  two  subtrees  of  the  root.  The  upper  bound  of 
2k(M)  one-cycle  message  sets  holds  for  all  messages  routed 
through  the  root  of  a  subtree.  But  since  all  subtrees  with  roots 
at  the  same  level  can  be  routed  at  the  same  time,  the  total 
number  of  delivery  cycles  required  is  at  most  the  height  of 
the  fat-tree  times  the  time  for  one  level,  which  yields 
d  =  0{k(M)  Ig  n). 

It  remains  to  show  that  the  message  sets  can  be  partitioned 
effectively.  Consider  once  again  the  message  set  Qi  of  mes¬ 
sages  going  left  to  right  through  the  root  of  the  fat-tree.  We 
now  show  that  each  message  set  Q*,  it  =  1,2,  •  •  •  ,r  -  1, 
can  be  partitioned  into  Qai  and  Qn*!  so  that  for  every  channel 
c  E  C,  the  messages  of  Qt  that  go  through  c  are  split  exactly 
evenly,  that  is,  so  that  load(Qai  c)  —  [(1/2)  load(Qt,  c)l  and 
load(2jtti,c)  s  [(1/2)  load(Qj,c)l.  The  partitioning  con¬ 
sists  of  two  parts,  matching  and  tracing,  and  is  reminiscent  of 
switch  setting  in  a  Benes  network  [34]  and  the  Eulerian  tour 
routing  algorithm  from  [10]. 

First,  do  the  matching.  Consider  each  message  in  Q*  as 
being  a  string  with  two  ends;  a  source  end  and  a  destination 
end.  Within  each  processor,  match  as  many  pairs  of  string 
ends  as  possible  until  at  most  one  message  of  2*  is  unmatched 
within  each  processor.  Notice  that  source  ends  are  matched 
only  with  source  ends  and  destination  ends  only  with  desti¬ 
nation  ends  because  all  messages  in  go  left  to  right  through 
the  root.  Then  consider  two-leaf  subtrees.  If  each  of  the  two 
leaves  has  one  unmatched  string  end,  match  the  ends.  Con¬ 
tinue  matching  the  unmatched  string  ends  in  four-leaf  sub¬ 
trees,  and  so  on  up  the  fat-tree.  At  every  level  of  the  fat-tree, 
at  most  one  string  end  is  unmatched  in  each  of  the  two  sub¬ 
trees  of  a  node.  At  the  root,  at  most  one  string  end  from  each 
side  will  be  unmatched  (when  there  is  an  odd  number  of 
messages  going  from  left  to  right  through  the  root). 

Now  the  tracing  phase  begins.  If  there  is  an  unmatched 
string  end  in  the  left  subtree,  start  with  it.  Otherwise,  pick  a 
string  end  arbitrarily  from  the  left  subtree.  Put  the  corre¬ 
sponding  message  into  Q-j,,  and  follow  the  string  to  the  right 
subtree.  Find  the  mate  of  the  string  end  on  the  right  side,  and 
put  the  corresponding  message  into  Follow  this  new 
string  back  to  the  left  side,  find  its  mate,  and  put  the  corre¬ 


sponding  message  into  Qy.  In  general,  when  traversing  a 
string  left  to  right,  put  the  corresponding  message  into  Q2x- 
When  traversing  right  to  left,  put  the  message  into  If 
we  discover  that  a  string  end  has  no  mate,  or  that  the  message 
corresponding  to  the  mate  has  already  been  assigned,  we 
have  either  found  the  (one)  unmatched  string  end  on  the  right 
or  completed  a  cycle.  In  either  event,  pick  another  string  end 
arbitrarily  and  continue  until  all  messages  in  Q*  have  been 
assigned  either  to  Qa  or  to  Qn+i. 

To  see  that  this  algorithm  evenly  splits  the  messages  of  2* 
in  every  channel  c,  observe  that  the  number  of  times  we  enter 
a  subtree  of  the  fat-tree  is  equal  to  the  number  of  times  we 
leave,  unless  we  are  tracing  the  one  possible  string  end 
matched  outside  the  subtree.  Since  the  split  is  even  in  every 
channel,  the  partitioning  of  Qy  into  one-cycle  message  sets 
Qr.“‘,  Q2r-i  will  be  achieved  when  . 


r  s  2  max 

e 

^  2  max 


load(Qi,c) 

cap(c) 

Ioad(Af,  c) 
cap(c) 


s  2A(Af), 

which  completes  the  proof.  ■ 

For  the  special  case  when  cap(c)  >  a  Ig  n,  for  some 
a  >  1,  the  logarithmic  factor  in  the  upper  bound  of 
Theorem  1  can  be  removed.  Thus,  under  these  conditions, 
the  lower  bound  of  the  load  factor  can  be  met  almost  exactly. 

Corollary  2:  Let  FT  be  a  fat-tree  on  n  processors,  let  C  be 
the  set  of  channels  in  FT,  and  suppose  that  there  is  a  constant 
a  >  1  such  that  cap{ c)  ^  a  Ig  n  for  all  c  £  C.  Then  for  any 
message  set  M,  there  is  an  off-line  schedule  M,,  Mi,  ■  ■  ■  .M,/ 
such  that  d  =  OKaja  -  l)k(M)). 

Proof:  For  each  channel  c  £  C,  define  a  set  of  fic¬ 
titious  capacities  cap'(c)  =  cap(c)  -  Ig  n.  The  fat-tree  with 
the  fictitious  capacities  has  a  load  factor  k'(M)  ^ 
{a/ a  —  l)A(Af).  Now  use  the  scheduling  algorithm  of  Theo¬ 
rem  1,  but  during  the  recursion  on  lower  levels  of  the  tree, 
rather  than  using  new  message  sets,  simply  reuse  the  2k' (M) 
message  sets  produced  by  partitioning  the  messages  through 
the  root. 

The  bisections  at  a  given  level  produce  partitions  of  the  set 
of  messages  that  are  equal  to  within  one,  and  this  error  can 
accumulate  in  a  single  channel  as  we  go  down  the  tree.  The 
largest  value  of  the  error  can  be  as  much  as  Ig  n,  but  the  actual 
capacities  are  never  exceeded,  and  so  each  of  the  2k' (M) 
message  sets  will  be  routable  in  one  delivery  cycle.  ■ 

Thus,  for  example,  if  the  capacities  are  each  at  least  2  Ig  n, 
the  number  of  delivery  cycles  is  not  worse  than  4k{M).  (In 
fact,  the  divide-and-conquer  partitioning  of  messages  can  be 
improved  to  2A(Af)  +  o(A(A/)).) 

IV.  The  Hardware  Requirements  of  Fat-Trees 

This  section  investigates  the  amount  of  hardware  required 
by  a  fat-tree.  We  give  a  precise  description  of  how  the 
switches  in  the  nodes  of  a  fat-tree  might  be  implemented  and 
determine  how  much  hardware  a  node  requires.  We  then 
define  the  channel  capacities  of  universal  fat-trees.  Finally, 
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we  determine  the  amount  of  hardware  required  to  build  uni¬ 
versal  fat-trees. 

The  model  for  hardware  that  we  use  is  an  extension  of 
Thompson’s  two-dimensional  VLSI  model  [29]  by  making 
the  natural  extension  to  three  dimensions.  In  this  model, 
wires  occupy  volume  and  have  a  minimum  cross-sectional 
area.  Similar  three-dimensional  models  have  been  studied  by 
Rosenberg  [26]  and  Leighton  and  Rosenberg  [16]. 

We  first  present  an  implementation  of  a  fat-tree  node.  As 
was  shown  in  Fig.  3,  most  of  the  switching  components  are 
contained  in  the  three  concentrator  switches.  According  to 
the  three-dimensional  VLSI  model,  however,  we  must  also 
be  concerned  with  the  amount  of  wire  consumed  by  the  inter¬ 
connection  of  the  components.  We  shall  show  that  a  fat-tree 
node  with  m  incident  wires  can  be  built  with  0(m)  com¬ 
ponents  in  a  box  whose  side  lengths  are  0(h  Vm),  0{h  Vm ), 
and  0(y/m/h),  for  any  1  £  >i  s  Vm.  The  node  requires 
constant  time  to  route  its  inputs. 

We  shall  need  some  definitions.  An  (r,  5)  concentrator 
graph  [2 1  ]  is  a  directed  acyclic  graph  with  r  inputs  and  s  ^  r 
outputs  such  that  any  k  ^  s  inputs  can  be  simultaneously 
connected  to  some  k  outputs  by  vertex-disjoint  paths.  An 
(r,s,a)  partial  concentrator  graph  is  a  directed  acyclic 
graph  with  r  inputs  and  s  ^  r  outputs  and  a  constant 
0  <  a  <  1  such  that  any  k  as  inputs  can  be  simultane¬ 
ously  connected  to  some  k  outputs  by  vertex-disjoint  paths. 

Pinsker  [2 1  ]  and  Pippenger  [22]  showed  that  (r,  s)  concen¬ 
trator  networks  can  be  built  with  0(r)  components  using 
probabilistic  constructions,  but  they  do  not  bound  the  depth 
of  the  graph,  which  we  wish  to  be  constant.  Pippenger  [23], 
however,  uses  another  probabilistic  argument  to  construct 
(r,  5,  a)  partial  concentrator  graphs  for  sufficiently  large  r 
where  s  =  2r/3  and  a  =  3/4.  The  partial  concentrator 
graphs  are  bipartite  (no  intermediate  vertices  between  inputs 
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any  permutation  of  m  inputs  and  m  outputs  can  be  routed  in 
O(m’)  area,  which  can  be  seen  by  considering  a  “crossbar" 
layout.  Thus,  in  two  dimensions,  the  wiring  of  the  compo¬ 
nents  and  external  wires  can  be  accomplished  by  laying  all 
components  and  external  wires  along  a  line  and  routing  the 
permutation  dictated  by  the  interconnection. 

The  construction  in  three  dimensions  is  essentially  that  of 
Leighton  and  Rosenberg  [16].  In  three  dimensions,  the  ex¬ 
ternal  wires  and  components  lie  on  a  face  of  a  box.  Any 
permutation  of  m  inputs  and  m  outputs  can  be  routed  in  a  box 
of  0(m^^)  volume  where  each  side  has  length  6>(Vm).  This 
proves  the  result  of  the  theorem  for  constant  h. 

To  extend  the  result,  we  use  a  result  of  Thompson  [29]  on 
converting  a  layout  of  height  h  into  a  layout  of  height  2. 
Consider  slicing  the  box  into  slices  of  height  h,  and  consider 
one  such  slice.  If  we  expand  each  of  the  other  two  dimensions 
by  a  factor  of  h,  the  h  layers  can  be  superimposed,  slightly 
offset  from  one  another.  Since  this  can  be  done  with  each  of 
the  slices  simultaneously,  the  theorem  follows.  ■ 

We  are  now  in  a  position  to  ascertain  the  cost  of  a  fat-tree 
implementation  based  on  the  capacities  of  its  channels.  If  the 
capacities  of  the  fat-tree  channels  are  determined  arbitrarily, 
the  analysis  could  be  messy.  For  the  fat-trees  that  will  be  used 
in  universality  results  of  Section  VI,  however,  the  channel 
capacities  can  be  characterized  by  the  capacity  at  the  root. 
This  section  defines  the  channel  capacities  of  a  universal 
fat-tree  and  evaluates  the  hardware  costs  of  an  imple¬ 
mentation.  Without  loss  of  generality,  and  for  simplicity,  we 
assume  in  this  section  that  the  number  of  connections  to  each 
processor  in  the  fat-tree  is  I . 

Let  FT  be  a  fat-tree  on  n  processors,  and  let  C  be  the  set 
of  channels  in  FT.  Consider  each  node  to  have  a  level  number 
that  is  its  distance  to  the  root,  and  give  each  channel  c  G  C 
the  same  level  number  as  the  node  beneath  it.  Thus,  for 


and  outputs),  every  input  has  degree  at  most  6,  and  every  example,  the  root  and  the  channel  between  the  root  and  the 

output  has  degree  at  most  9.  By  pasting  several  of  these  external  interface  are  both  at  level  0.  The  processors  and  the 

graphs  together,  outputs  to  inputs,  any  constant  ratio  of  con-  channels  leaving  them  are  at  level  Ig  n.  If  the  channel  at  level 
centration  can  be  obtained  in  constant  depth.  For  a  given  set  0  has  capacity  w.  then  we  say  that  FT  has  root  capacity  w. 

of  inputs,  the  paths  through  the  graph  can  be  set  up  in  poly-  Definition:  Let  FT  be  a  fat-tree  on  n  processors  with  root 

nomial  time  using  network  flow  techniques  or  by  performing  capacity  w  where  s  w  s  n.  Then  if  each  channel  c  G  C 
a  sequence  of  matchings  on  each  level  of  the  graph.  at  level  k  satisfies 

We  use  a  partial  concentrator  graph  to  construct  a  good 
concentrator  switch.  We  simply  make  switching  decisions  at  cap(c)  =  min 

the  inputs  to  each  level.  These  decision  bits  can  be  inter¬ 
leaved  with  the  address  bits  that  specify  the  path  of  a  message  we  call  FT  a  universal  fat-tree. 

through  the  fat-tree.  In  order  to  use  the  off-line  routing  results  The  capacities  of  the  channels  of  a  universal  fat-tree  grow 


from  Section  III,  we  treat  the  actual  capacity  of  a  channel  as 
a  times  the  number  of  wires,  which  changes  the  results  by 
only  a  constant  factor. 

We  now  turn  our  attention  to  the  physical  structure  of  a 
fat-tree  node.  A  node  with  m  incident  wires  contains  0(m) 
components.  The  next  theorem  gives  the  physical  volume 
necessary  to  wire  the  components. 

Lemma  3:  A  set  of  m  components  and  external  wires  can 
be  wired  together  according  to  an  arbitrary  interconnection 
pattern  to  fit  in  a  whose  side  lengths  are  0(h's/m), 
0(h''/m).  and  0(  \  m /h),  for  any  I  ^  h  s  Vm. 

Proof:  We  need  to  use  the  fact  that  in  two  dimensions. 


exponentially  as  we  go  up  the  tree  from  the  leaves.  Initially, 
the  capacities  double  from  one  level  to  the  next,  but  at  levels 
closer  than  3  lg(n/vv)  to  the  root,  the  channel  capacities  grow 
at  the  rate  of  V4. 

We  can  now  determine  the  hardware  required  by  a  univer¬ 
sal  fat-tree. 

Theorem  4:  Let  FT  be  a  universal  fat-tree  on  n  processors 
with  root  capacity  w  where  n'  ^  ^  w  s.  n.  Then  there  is  an 
implementation  of  FT  in  a  cube  of  volume  v  =  0((w  lg(n/ 
w))^^)  with  Ofn  lg(w-'/n'))  components. 

Proof:  We  first  establish  the  component  count.  For  a 
node  at  level  k  ^  3  lg(/i/vv).  the  number  of  components  in 
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the  node  is  and  the  number  of  nodes  at  level  k  is 

2*.  Thus,  the  number  of  components  in  all  levels  between  0 
and  3  lg(n/w)  is 

2  2*0(*v/2“'')  =  vv  2  <5(2*'^) 

fO  »-0 

=  0{n) 

since  the  largest  term  of  the  geometric  series  occurs  when 
k  =  3  lg(n/w).  Nearer  the  leaves,  each  level  has  about  the 
same  number  of  components.  The  total  number  in  the  levels 
between  3  lg(n/w)  and  Ig  n  is 

2  2*0(/i/2*)  =  0{n  Ig(wVn^))  ■ 

Thus,  the  number  of  components  nearer  the  leaves  of  the 
fat-tree  dominates. 

The  volume  bound  is  somewhat  more  intricate  to  establish, 
but  is  essentially  the  unrestricted  three-dimensional  layout 
construction  given  by  Leighton  and  Rosenberg  [16].  The 
interested  reader  is  referred  to  their  paper.  Similar  divide- 
and-conquer  layout  strategies  for  two  dimensions  can  be 
found  in  [3],  [12],  [14],  [17],  [18],  [32].  ■ 

Theorem  4  gives  the  volume  of  a  fat-tree  in  terms  of  its 
root  capacity.  For  the  universality  results  of  Section  VI,  we 
shall  be  interested  in  the  reverse. 

Definition:  Let  FT  be  a  universal  fat-tree  that  occupies 
volume  V  and  has  root  capacity  Q(,v'-'^/\g(n/v-^)).  Then  FT  is 
a  universal  fat-tree  of  volume  v. 

Remark:  A  universal  fat-tree  on  n  processors  of  volume  v 
must  satisfy  u  =  LKn  Ig  n)  and  v  =  0(n’'^)  to  be  well  de¬ 
fined.  By  modifying  the  definition  of  a  universal  fat-tree,  the 
lower  bound  can  be  relaxed  to  Cl(n),  which  results  in  minor 
changes  to  the  bounds  quoted  in  the  universality  theorem  of 
Section  VI. 

V.  Decomposition  Trees 

The  physical  implementation  of  a  routing  network  con¬ 
strains  the  ability  of  processors  in  a  parallel  supercomputer  to 
communicate  with  one  another.  The  universality  theorem 
from  Section  VI  makes  essentially  one  assumption  about 
competing  networks:  at  most  0(a)  bits  can  pass  through  a 
surface  of  area  a  in  unit  time.  This  assumption  can  be  brought 
to  bear  on  an  arbitrary  portion  of  a  routing  network  im¬ 
plementation  through  the  use  of  decomposition  trees,  a  re¬ 
finement  of  the  graph-theoref’c  notion  of  separators  [19j. 
Similar  results  can  be  found  in  the  VLSI  theory  literature. 
The  results  presented  here  generalize  and  greatly  simplify 
some  of  the  constructions  in  the  literature,  notably  those  in 
[3],  [4],  and  [13].  The  generalizations  arc  necessary  for  the 
proof  of  the  universality  theorem. 

A  routing  network  R  interconnecting  a  set  P  of  processors 
has  a  [wo.w,,  •  •  • ,  w,]  decomposition  tree  if  the  amount  of 
information  that  can  enter  or  leave  the  set  P  of  processors 
from  the  outside  world  is  at  most  wq  bits  per  unit  time;  P  can 
be  partitioned  into  two  sets  Po  and  P,  such  that  the  amount  of 
information  that  can  enter  or  leave  each  set  is  at  most  tv,  bits 


per  unit  time;  each  of  Pq  and  F,  can  be  partitioned  into  two 
sets  such  that  the  bandwidth  to  and  from  each  of  the  four  sets 
is  at  most  W2;  and  so  on,  until  every  set  at  the  rth  level  has 
either  zero  or  one  processors  in  it.  When  the  bandwidth  de¬ 
creases  by  a  constant  amount  from  one  level  to  the  next,  we 
shall  adopt  a  shorthand  notation.  We  shall  say  that  R  has  a 
(tv,  a)  decomposition  tree  for  I  <  a  s  2  if  it  has  a 
[w,  w/a,  w/a^,  -  •  ■ ,  0(1)]  decomposition  tree.  (For  VLSI 
graph  layouts,  there  is  a  similar  notion  called  bifurcators 
[3],  [13].) 

Theorem  5:  Let  R  be  a  routing  network  that  occupies  a 
cube  of  volume  v.  Then  R  has  an  (0(u’'^J,  'VJ^)  decomposition 
tree. 

Proof:  The  cube  has  side  length  and  surface  area 
Imagine  a  rectilinearly  oriented  plane  that  splits  the 
cube  into  two  equal  boxes,  each  occupying  volume  u/2.  This 
cutting  plane  naturally  partitions  the  processors  into  two  sets. 
Partition  each  of  the  two  boxes  by  repeating  this  procedure 
with  a  plane  perpendicular  to  the  first.  Continuing  now  in  the 
third  dimension  yields  eight  cubes.  Repeat  this  procedure 
until  each  box  contains  either  zero  or  one  processors. 

The  volume  of  each  of  the  2'  boxes  generated  by  the  ith  cut 
is  v/2‘,  and  the  surface  area  is  at  most  4‘'y4  (u/2')^^  Let  y  be 
the  constant  factor  by  which  the  bandwidth  of  information 
transfer  differs  from  the  surface  area.  Then  the  routing  net¬ 
work  R  has  a  (4"^  ‘'J^)  decomposition  tree.  ■ 

A  decomposition  tree  generated  by  the  cutting  plane  meth¬ 
od  can  be  unbalanced  in  the  sense  that  the  number  of  pro¬ 
cessors  lying  on  either  side  of  a  given  cut  may  be  unequal. 
Following  the  approach  of  Bhatt  and  Leighton  [3],  we  define 
a  balanced  decomposition  tree  to  be  a  decomposition  tree  in 
which  the  number  of  processors  on  either  side  of  a  given 
partition  is  equal,  to  within  one.  We  shall  show  that  a  bal¬ 
anced  decomposition  tree  can  be  produced  from  an  unbal¬ 
anced  one. 

First,  however,  we  shall  need  two  combinatorial  lemmas. 
The  first,  which  deals  with  the  partitioning  of  strings  of 
pearls,  is  typical  of  lemmas  proved  in  the  VLSI  theory 
literature. 

Lemma  6:  Consider  any  two  strings  composed  of  even 
numbers  of  black  and  white  pearls.  By  making  at  most  two 
cuts,  the  pearls  can  be  divided  into  two  sets,  each  containing 
at  most  two  strings,  such  that  each  set  has  exactly  half  the 
pearls  of  each  color. 

Proof Call  the  strings  L  and  S  for  “long"  and  “short.^ 
We  use  a  continuity  argument  to  show  that  two  sets  A  and  A 
satisfying  the  conditions  of  the  lemma  can  always  be  pro¬ 
duced.  Place  the  strings  L  and  S  end-to-end  in  a  circle,  as  is 
illustrated  in  Fig.  4(a).  Let  A  be  the  set  of  pearls_comprising 
the  shaded  half  of  the  circle  in  Fig.  4(b),  and  let  A  be  the  set 
of  pearls  in  the  other  half.  Suppose  without  loss  of  generality 
that  the  set  A  contains  too  many  black  pearls  and  set  A 
contains  too  few.  We  shall  show  how  to  transform  set  A  so 
that  it  occupies  the  initial  position  of  set  A.  The  trans¬ 
formation  consists  of  a  sequence  of  moves  such  that  for  each 

^Thanks  to  G.  Miller  of  USC.  who  provided  this  argument,  which  is  simpler 
than  our  original  algebraic  proof. 
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Fig.  4.  The  partitioning  argument,  (a)  The  two  strings  i  and  5  laid  end-to-end 
in  a  circle,  (b)  The  initial  position  of  set  A.  (c)-(f)  The  transformation  of  A 
into  A . 

move,  the  number  of  blacks  within  set  A  changes  by  at  most 
one.  Since  set  A  starts  out  with  too  many  black  pearls  and 
ends  with  too  few,  by  continuity  there  will  be  a  position  in  the 
middle  where  A  has  exactly  half  the  black  pearls.  Further¬ 
more,  because  A  has  half  the  total  number  of  pearls,  it  will 
also  have  half  the  white  pearls. 

The  transformation  begins  by  rotating  set  A  counter¬ 
clockwise,  as  shown  in  Fig.  4(c),  until  it  reaches  the  position 
shown  in  Fig.  4(d).  Then,  set  A  is  broken  into  two  pieces  and 
the  tailing  piece  is  rotated  clockwise  until  it  meets  up  with  the 
leading  piece  on  the  other  side,  shown  in  Fig.  4(e)  and  (f). 
The  position  of  set  A  is  now  the  initial  position  of  set  A.  As 
can  be  verified,  at  all  times  during  the  transformation,  sets  A 
and  A  each  contain  at  most  two  strings.  ■ 

Lemma  7:  Let  T  be  a  complete  binary  tree  drawn  in  the 
natural  way  with  leaves  on  a  straight  line,  and  consider  any 
string  s  of  k  consecutive  leaves.  Then  there  exists  a  forest  F 
of  complete  binary  subtrees  ofT  such  that  I )  the  leaves  of  F 
are  precisely  the  leaves  in  s,  2)  there  are  at  most  two  trees  of 
any  given  height,  and  3)  the  height  of  the  largest  tree  is  at 
most  Ig  k. 

Proof:  The  forest  is  constructed  from  the  maximal  com¬ 
plete  subtrees  of  T  whose  leaves  lie  only  in  s.  ■ 

Theorem  8:  Let  R  be  a  routing  network  on  n  processors 
that  has  a  [wg.  w,.  ••• ,  w,]  decomposition  tree  T.  Then  R  has 
o  Iw'o,  w’l,  -  •  • ,  wliji„i}  balanced  decomposition  tree  T'  where 

r 

w’  =  4  . 

k»t 

Proof:  Draw  the  decomposition  tree  T  in  the  natural 
way  with  the  T  leaves  on  a  line.  Each  leaf  either  contains  a 
processor  or  else  it  is  empty.  If  the  leaf  contains  a  processor, 
color  it  black;  otherwise,  color  it  white.  Considering  the  line 


of  processors  as  a  string  of  black  and  white  pearls,  as  in 
Lemma  6,  we  can  cut  the  string  in  at  most  two  places  such 
that  the  pearls  are  divided  into  two  sets,  each  containing  at 
most  two  strings,  such  that  each  set  has  exactly  half  the  pearls 
of  each  color.  This  partition  represents  the  first  level  in  the 
balanced  decomposition  tree  T' . 

Recursively  partition  each  of  the  two  sets  using  Lemma  6. 
At  each  step,  the  number  of  black  pearls  (processors)  is  split 
evenly  in  a  set,  and  each  set  contains  at  most  two  strings 
(consecutive  leaves  from  the  decomposition  tree  T).  Thus, 
at  level  Rg  nl.  each  set  (leaf  of  T')  contains  at  most  one 
processor. 

It  remains  to  prove  the  bound  on  the  rates  w,'  of  informa¬ 
tion  transfer  in  and  out  of  each  subtree  of  the  balanced  de¬ 
composition  tree  T' .  Each  subtree  of  T'  corresponds  to  at 
most  two  strings  of  leaves  from  the  original  decomposition 
tree  T.  From  Lemma  7,  these  two  strings  correspond  to  a 
forest  of  complete  binary  trees  with  at  most  four  trees  of  a 
given  height.  All  external  communication  of  a  complete  bi¬ 
nary  subtree  of  a  decomposition  tree  occurs  through  the 
surface  corresponding  to  its  root.  Thus,  the  external  commu¬ 
nication  per  unit  time  of  a  subtree  of  T '  is  bounded  by  the  sum 
of  bandwidths  from  the  roots  of  the  corresponding  complete 
binary  subtrees  of  7.  ■ 

Corollary  9:  Let  R  be  a  routing  network  that  has  alw.a) 
decomposition  tree  for  1  <  a  S  2.  Then  S  has  a 
{Mala  —  l)w,  a)  balanced  decomposition  tree. 

Proof:  The  summation  in  Theorem  8  becomes  a  geo¬ 
metric  series.  ■ 

VI.  Universality  of  Fat-Trees 

We  now  show  that  a  fat-tree  is  universal  for  the  amount  of 
interconnection  hardware  it  requires  in  the  sense  that  any 
other  routing  network  of  the  same  volume  can  be  efficiently 
simulated.  From  a  theoretical  point  of  view,  we  define 
“efficiently”  as  meaning  at  most  polylogarithmic  slowdown. 
Polylogarithmic  time  in  parallel  computation  corresponds  to 
polynomial  time  for  sequential  computation. 

Some  may  argue  that  polylogarithmic  slowdown  may  not 
be  efficient  if  the  exponent  of  the  logarithm  is  large.  The 
ability  of  one  parallel  computer  to  simulate  another,  how¬ 
ever,  merely  gives  confidence  in  the  general-purpose  nature 
of  the  computer.  The  loss  of  efficiency  in  the  simulation  is 
not  felt  if  the  parallel  computer  is  programmed  directly. 

Many  of  the  networks  currently  being  built  are  not  univer¬ 
sal  (for  example,  two-dimensional  arrays,  simple  trees,  or 
multigrids).  These  networks  exhibit  polynomial  slowdown 
when  simulating  other  networks.  Thus,  they  have  no  the¬ 
oretical  advantage  over  a  sequential  computer  which  can 
easily  simulate  a  network  with  polynomial  slowdown.  Inter¬ 
estingly,  hypercube-based  networks  are  universal  for  volume 
0(n’'^),  but  as  we  have  observed,  they  do  not  scale  down  to 
smaller  volumes. 

Theorem  10:  Let  FT  be  a  universal  fat-tree  on  a  set  of  n 
processors  that  occupies  a  cube  of  volume  v,  and  let  R  be  an 
arbitrary  routing  network  on  a  set  of  n  processors  that  also 
occupies  a  cube  of  volume  v.  Then  there  is  an  identification 
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of  the  processors  in  FT  with  the  processors  of  R  with  the 
following  property.  Any  message  set  M  that  can  be  delivered 
in  time  t  by  R  can  be  delivered  by  FT  (off-line)  in  time 
0(t  Ig^  n). 

Proof:  By  Theorem  5,  the  routing  network  R  has  an 
(0(v^^),  ■'^4)  decomposition  tree,  and  hence  by  Corollary  9, 
it  also  has  an  (0(u^’),  "'^4)  balanced  decomposition  tree. 
Identify  the  processors  at  the  leaves  of  the  balanced  decom¬ 
position  tree  of  R  in  the  natural  way,  with  the  processors  at 
the  leaves  of  the  fat-tree  FT. 

By  assumption,  routing  network  R  can  deliver  all  the  mes¬ 
sages  in  a  message  set  Af  in  time  t.  In  unit  time,  at  most 
0(v^^l2^^)  messages  can  enter  or  leave  a  subtree  rooted  at 
level  k  in  ^’s  balanced  decomposition  tree.  Thus,  in  t  time. 


the  total  number  of  messages  that  can  enter  or  leave  a  subtree 
rooted  at  level  k  is  Oitv^^/2^^). 

There  is  a  second  bound  on  the  transfer  of  information  in 
and  out  of  a  subtree  of  the  balanced  decomposition  tree  of  R. 
The  number  of  messages  that  can  enter  or  leave  a  single 
processor  in  time  f  is  OU)  since  the  number  of  connections 
to  a  processor  is  constant.  Since  there  are  at  most  n/2‘  pro¬ 
cessors  in  a  subtree  rooted  at  level  k,  the  total  number  of 
messages  that  can  enter  or  leave  this  subtree  in  t  time  is 
0(m/2*). 

We  now  compute  an  upper  bound  on  the  load  factor  K{M) 
that  Af  puts  on  the  fat-tree  FT.  Let  c  be  a  channel  at  level  k 
in  FT.  We  have  just  seen  that  the  number  load(Af .  c)  of  mes¬ 
sages  of  M  that  must  go  through  c  is  0(tv~^ /2^'"^)  and 
<9(fn/2‘).  Since  FT  is  a  universal  fat-tree  with  root  capacity 
Q(v-^/\g{n/v^^)),  the  capacity  of  c  is 


cap(c)  =  minj  ^ 


Thus,  the  load  factor  on  c  due  to  M  is 

A(Af,c)  =  0{t  lg(n/u^^)) , 
and  the  load  factor  on  the  whole  fat-tree  is 
A(;Vf)  =  0{t  \g{n/v'-'^)) . 

The  off-line  routing  result  from  Theorem  1  says  that 
0(t  lg(n/u"’)  Ig  n)  delivery  cycles  are  sufficient  to  route  all 
the  messages  in  M.  Since  the  fat-tree  can  execute  an  off-line 
delivery  cycle  in  0(lg  n)  time,  the  result  follows.  ■ 

The  0(lg^  n)  factor  lost  in  simulation  is  attributable  to  the 
channel  capacities,  the  routing  algorithm,  and  the  switching. 
Of  these  three,  only  the  last,  the  OOg  n)  switching  time  for 
a  delivery  cycle,  seems  to  be  a  necessary  cost. 

The  first  0(lg  n)  factor  (actually  0(lg(n/i/^’)))  is  because 
a  fat-tree  of  volume  v  has  a  root  capacity  of  0(.v^^/\g{n/v^^)). 
This  logarithmic  factor  vanishes  for  the  simulation  of  net¬ 
works  that  have  only  slightly  less  (0{v/\g^'\n/ u^’)))  volume. 
We  have  chosen  to  put  ail  the  simulation  expense  in  time  so 
that  the  comparison  will  be  equal  hardware  versus  equal 
hardware. 

The  second  0(lg  n)  factor  is  lost  by  the  off-line  routing 
algorithm.  In  fact,  we  have  recently  discovered  [8]  that  off¬ 


line  routing  in  0(k(M)  +  Ig  n  Ig  Ig  n)  delivery  cycles  is 
always  possible.  Moreover,  if  we  assume  that  each  processor 
has  0(lg  n)  connections,  as  is  required  by  a  Boolean  hyper¬ 
cube,  and  each  channel  has  capacity  fidg  n).  Corollary  2 
from  Section  III  allows  us  to  route  in  0(A(A/))  delivery 
cycles. 

An  important  application  of  the  universality  of  fat-trees  is 
to  the  simulation  of  fixed-connection  networks,  that  is,  net¬ 
works  that  have  direct  connections  between  processors.  Here 
we  relax  the  technical  assumption  in  the  definition  of  a  uni¬ 
versal  fat-tree  to  allow  the  processors  to  have  a  given  number 
d  of  connections  to  the  routing  network,  instead  of  1.  Such 
a  universal  fat-tree  of  volume  0{v  \^'}(n/v^^))  on  n  pro¬ 
cessors  can  simulate  an  arbitrary  degree  d  fixed-connection 
network  of  volume  u  on  «  processors  with  only  0(lg  n)  time 
degradation.  The  idea  is  that  the  channel  capacities  of  the 
universal  fat-tree  are  sufficiently  large  that  the  connections 
implied  by  the  network  can  be  represented  as  a  one-cycle 
message  set,  which  requires  0(lg  n)  time  to  be  delivered. 

High-volume  universal  fat-trees  can  be  compared  to  clas¬ 
sical  permutation  networks,  which  all  require  Cl(n^'^)  volume. 
A  universal  fat-tree  on  n  processors  with  0(n^^)  volume  can 
route  an  arbitrary  permutation  off-line  in  time  0(lg  n).  Up  to 
constant  factors,  this  is  the  best  possible  bound  (assuming 
bounded-degree  processors),  but  it  is  also  achievable,  for 
instance,  by  Benes  networks  [2],  [34]  or  by  on-line  sorting 
networks  [I],  [15]. 

A  natural  extension  to  the  off-line  routing  results  presented 
here,  and  indeed,  the  one  that  motivates  the  entire  paper,  is 
the  problem  of  on-line  routing  in  fat-trees.  Not  surprisingly, 
there  are  universal  fat-trees  for  on-line  routing.  In  results  to 
be  reported  elsewhere  [8]  we  have  discovered  a  randomized 
routing  algorithm  that  delivers  all  messages  in  C>(A(M)  + 
Ig  n  Ig  Ig  n)  delivery  cycles  with  high  probability  [8],  but 
the  nodes  of  the  fat-tree  have  somewhat  different  structure 
from  the  design  given  here.  Using  this  result  and  essentially 
the  construction  given  in  this  paper,  one  can  obtain  an  on-line 
analog  to  Theorem  10,  except  with  an  0(lg^  n  Ig  Ig  n)  time 
degradation.  We  anticipate  further  research  will  improve 
this  bound. 

Vll.  Concluding  Remarks 

Universality  has  been  studied  more  generally  in  the  paral¬ 
lel  computation  literature.  Valiant  [33]  and  Valiant  and 
Brebner  [31]  have  discovered  universal  routing  schemes  for 
large-volume  networks.  Galil  and  Paul  [7]  have  proposed  a 
general-purpose  parallel  processor  based  on  the  cube- 
connected-cycles  network  [25]  that  can  simulate  any  other 
parallel  processor  with  only  a  logarithmic  loss  in  efficiency. 
Valiant  [30]  has  shown  that  there  are  classes  of  universal 
Boolean  circuits.  A  universal  circuit  of  a  given  size  can  be 
programmed  to  simulate  any  circuit  whose  size  is  only 
slightly  smaller.  Fiat  and  Shamir  [6]  have  proposed  a  uni¬ 
versal  architecture  for  systolic  array  interconnections. 

Universal  fat-trees  are  parameterized  not  only  in  the  num¬ 
ber  of  processors,  but  also  in  volume,  which  is  indirectly  a 
measure  of  communication  potential.  By  considering  arbi- 
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trary  networks  in  terms  of  these  two  parameters,  we  have 
seen  that  the  one  fat-tree  architecture  is  near-optimal 
throughout  the  entire  range  of  the  parameters.  For 
communication-limited  engineering  situations,  one  need  not 
necessarily  retreat  to  special-purpose  devices  in  order  to 
compute  efficiently  in  parallel. 

Fat-trees  have  the  advantage  that  they  are  a  robust  engi¬ 
neering  structure.  In  principle,  one  need  not  worry  about  the 
exact  capacities  of  channels  as  long  as  the  capacities  exhibit 
reasonable  growth  as  we  go  up  the  tree.  As  a  practical  matter, 
one  should  build  the  biggest  fat-tree  that  one  can  afford,  and 
the  architecture  automatically  ensures  that  communication 
bandwidth  is  effectively  utilized.  Another  feature  of  fat-trees 
is  that  algorithms  are  the  same  no  matter  how  big  the  fat-tree 
is.  Code  is  portable  in  that  it  can  be  moved  between  an 
inexpensive  computer  and  a  more  expensive  one.  Finally, 
the  root  channel  offers  a  natural  high-bandwidth  external 
connection. 

Although  universal  fat-trees  have  many  desirable  proper¬ 
ties,  there  are  many  issues  in  the  design  of  a  routing  network 
that  we  have  not  faced  directly.  For  example,  despite  our 
concern  for  wirability,  we  have  not  presented  a  practical 
packaging  scheme.  Possibly,  the  packaging  techniques  for 
trees  from  [4]  and  [18]  can  be  exploited.  The  constraints  to  be 
faced  in  packaging,  however,  will  only  be  more  stringent 
than  the  surface  area  constraint  given  in  Assumption  L3.  We 
have  attempted  to  deal  with  “pin  boundedness”  in  a  simple 
mathematical  model,  and  our  results  should  generalize  to 
more  complicated  packaging  models. 

Another  issue  that  we  have  not  addressed  is  how  messages 
should  be  sent  in  the  network.  The  choice  of  the  bit-serial 
approach  in  Section  li  has  the  advantage  that  the  hardware  is 
cheaper,  but  we  may  be  paying  in  the  performance  of  the 
routing  algorithm.  We  also  assumed  the  architecture  was 
synchronized  by  delivery  cycle.  Presumably,  fat-tree  archi¬ 
tectures  can  be  built  with  different  design  decisions. 

Whether  the  notion  of  universal  parallel  supercomputers  is 
consistent  with  engineering  reality,  however,  remains  an 
open  question.  Independent  of  routing  network  issues,  there 
are  many  other  problems  that  must  be  solved  if  abstract 
n-processor  parallel  supercomputers  are  to  become  a  reality. 
For  example,  problems  of  maintenance,  fault  tolerance, 
clock  distribution,  and  reliable  power  supply  must  be  solved. 
The  hardware  mechanisms  needed  for  synchronization  and 
instruction  distribution,  which  are  simple  for  single¬ 
processor  machines,  may  be  sufficiently  complicated  to  over¬ 
whelm  the  advantages  of  having  many  processors. 

But  the  largest  problem  that  must  be  solved  in  parallel 
supercomputing  seems  to  us  to  be  the  problem  of  pro¬ 
gramming  the  system  with  the  concerns  of  both  programming 
abstraction  and  algorithmic  integrity  (computational  re¬ 
sources  are  not  free).  A  supercomputer  should  not  be  a  mere 
supercalculator  (good  at  one  restricted  algorithm).  It  should 
have  the  power  to  efficiently  execute  many  different  parallel 
algorithms  and  to  easily  combine  the  results  of  separate  par¬ 
allel  computations.  A  universal  machine  has  the  power,  not 
just  of  any  omer  machine,  but  of  all  other  machines. 


Acknowledgment 

T.  Knight  of  the  M.l.T.  Artificial  Intelligence  Laboratory 
inspired  the  author  with  many  discussions  about  connection 
machine  architectures,  which  led  him  towards  the  ideas  in 
this  paper.  He  is  to  blame  for  the  name  “fat-tree.”  Thanks  to 
R.  Greenberg  of  the  M.l.T.  Laboratory  for  Computer  Science 
for  finding  many  bugs  in  various  versions  of  the  paper. 
Thanks  also  to  A.  Bawden,  T.  Bui,  B.  Chor,  T.  Cormen,  O. 
Goldreich,  A.  Ishii,  T.  Leighton,  B.  Maggs,  M.  Maley,  G. 
Miller,  C.  Phillips,  R.  Rivest,  and  P.  Shor  for  their  helpful 
comments  and  technical  assistance. 


References  . 

[I]  M.  Ajtai,  J.  Komlos,  and  E.  Szemeredi,  "Soning  in  c  log  n  parallel 
steps"  Combinatorica,  vol.  3,  no.  1,  pp.  1-19,  1983. 

(21  V.  E.  Benes,  Mathematical  Theory  of  Connecting  Networks  and  Tele¬ 
phone  Traffic.  New  York;  Academic.  1965. 

(31  S.N.  Bhati  and  F.  T.  Leighton,  "A  framework  for  solving  VLSI  graph 
layout  problems,"  J ■  Comput.  Syst.  Sci.,  vol.  28,  no.  2,  pp.  300-343. 
Apr.  1984. 

(41  S.N.  Bhatt  and  C.E.  Leiserson,  "How  to  assemble  tree  machines."  in 
Advances  in  Computing  Research.  Vol.  2.  Greenwich,  CT;  Jai  Press. 
1984,  pp.  95-114. 

(51  S.  Fahiman,  NETL:  A  System  for  Representing  and  Using  Real-World 
Knowledge.  Cambridge,  MA:  M.l.T.  Press.  1979. 

(61  A.  Fiat  and  A.  Shamir,  "Polymorphic  arrays;  A  novel  VLSI  layout  for 
systolic  computers."  in  Proc.  IEEE  25th  Anna.  Symp.  Foundations  Com¬ 
put.  Sci.,  Oct.  1984,  pp.  37-45. 

(71  2.  Galil  and  W.  J.  Paul,  "An  efficient  general-purpose  parallel  com¬ 
puter."  J.  ACM,  vol.  30.  no.  2.  pp.  360-387,  Apr.  1983. 

(81  R.  I.  Greenberg  and  C.  E.  Leiserson,  "Randomized  routing  on  fat-trees." 
in  Proc.  IEEE  26th  Annu.  Symp,  Foundations  Comput.  Sci.,  Nov.  1985. 
to  appear. 

(91  W.  D.  Hillis,  "The  connection  machine,  "  Arxif.  Intell.  Lab..  Mass.  Inst. 
Technol..  Tech.  Memo.  646,  Sept.  1981. 

(lOl  R.  M.  Karp,  F.  T.  Leighton.  R.  L.  Rivest,  C.  D.  Thompson,  U.  Vazirani. 
and  V.  Vazirani,  "Global  wire  routing  in  two-dimensional  arrays,"  in 
Proc.  IEEE  24th  Annu.  Svmp.  Foundations  Comput.  Sci..  Nov.  1983. 
pp.  453-459. 

(HI  T.  F.  Knight.  'The  cross-omega  router.  "  Mass.  Inst.  Technol.,  Cam¬ 
bridge.  unpublished  manuscript.  1984. 

(121  F.  T.  Leighton.  "New  lower  bound  techniques  for  VLSI."  Math.  Syst. 
Theory,  vol.  17,  no.  1.  pp.  47-70,  Apr.  1984. 

(131  - .  "A  layout  strategy  for  VLSI  which  is  provably  good."  in  Proc.  I4th 

Annu.  ACM  Symp.  Theory  Comput..  May  1982,  pp.  85-98. 

(141  - .  Complexitv  Issues  in  VLSI.  Cambridge,  MA:  M.l.T.  Press. 

1983. 

(151  - ■  “Tight  bounds  on  the  complexity  of  parallel  sorting, "  IEEE  Trans. 

Comput..  vol.  C-34,  pp.  344-354.  Apr.  1985. 

(16]  F.  T.  Leighton  and  .\.  L.  Rosenberg.  "TTiree  dimensional  circuit  layouts," 
Lab  Comput.  Sci..  Mass.  Inst.  Technol.,  Tech.  Rep.  MIT-LCS-TM- 
262.  lune  1984. 

(171  C.  E.  Leiserson,  "Area-efficient  graph  layouts  (for  VLSI)."  in  Proc.  IEEE 
2 1  St  Annu.  Symp.  Foundations  Comput.  Sci.,  Oct.  1980,  pp.  270-281. 

(181  - ,  Area-Efficient  VLSI  Computation.  Cambridge,  MA:  M.l.T. 

Press,  1983. 

(191  R.  J.  Lipton  and  R.  E.  Tarjan,  "A  planar  separator  theorem,"  SIAM  J. 
Appl.  Math.,  vol.  36.  no.  2,  pp.  177-189,  Apr.  1979. 

(201  C.  A.  Mead  and  L.  A.  Conway,  Introduction  to  VLSI  Sys¬ 
tems.  Reading,  MA;  Addison-Wesley.  1980. 

(211  M.  S.  Pinsker,  "On  the  complexity  of  a  concentrator."  in  Proc.  7th  Int. 
Teletraffic  Conf.,  Stockholm.  Sweden.  June  1983,  pp.  318/1-318/4. 

(22)  N.  J.  Pippenger.  "The  complexity  theory  of  switching  networks,"  Ph.D. 
dissertation,  Dep.  Elec.  Eng.  Comput.  Sci.,  Mass.  Inst.  Technol..  Cam- 
bndge,  Aug.  1973. 

(23)  - ,  "Superconcentrators."  S/AAf  J.  Comput..  vol.  6,  no.  2.  pp.  298- 

304,  June  1977. 


LE1SERSON-.  Fat-trees 


901 


[24]  - ,  “Parallel  communication  with  limited  bul'fers,"  in  Proc.  IEEE 25th 

Annu.  Symp.  Foundations  Compui.  Sci.,  Oct.  1984,  pp.  127-136. 

[25]  F.  P.  Pieparata  and  J.  E.  Vuillemin.  "The  cube-connected  cycles:  A  ver¬ 
satile  network  for  parallel  computation,"  Com/nun.  ACA/,  vol.  24.no.  5. 
pp.  300-309,  May  1981. 

[26]  A.  L.  Rosenberg.  "Three-dimensional  integrated  circuitry,"  in  Proc. 

CMU  Conf.  VLSI  Svst.  Comput..  H.  T.  Kung,  R.  Sproull,  and  O.  Steele. 

Eds..  Oct.  1981.  pp.  69-79. 

[27]  J.T.  Schwanz,  "Ultracomputers."  ACM  Trans.  Programming  Lang. 

Sysl..  vol.  2.  no.  4,  pp.  484-521.  1980. 

[28]  H.  S.  Stone.  “Parallel  processing  with  the  perfect  shuffle."  IEEE  Trans. 

Comput..  vol.  C-20.  pp.  153-161,  Feb.  1971.  computation.  VLSI  architectures,  graph  theory,  digital  circuit  timing,  analysis 

[29]  C.  D.  Thompson,  “A  complexity  theory  for  VLSI,"  Ph.D.  dissertation.  of  algorithms,  computer-aided  design,  placement  and  routing,  wafer-scale 

Camegie-Mellon  Univ.,  Pittsburgh,  PA.  1980.  ‘  integration,  layout  compaction,  and  most  recently,  parallel  supercomputing. 

[30]  L.G.  Valiant.  “Universal  circuits,"  in  Proc.  8th  Annu.  ACM  Symp.  His  principal  interest  is  in  the  theoretical  foundation  of  parallel  computation. 

Theory  Comput..  May  1976,  pp.  196-203.  especially  as  it  relates  to  engineenng  reality. 

[31]  L.G.  Valiant  and  G.  J.  Brebner.  “Universal  schemes  for  parallel  commu-  Prof.  Leiserson  has  authored  over  20  papers  on  the  theory  of  VLSI  and 

nication."  in  Proc.  13th  Annu.  ACM  Symp.  Theory  Comput..  May  1981 ,  parallel  algorithms.  As  a  graduate  student  at  Camegie-Mellon  he  wrote  the  first 

pp.  263-277.  paper  on  systolic  architectures  with  H.  T.  Kung,  for  which  they  received  a  U.S. 

[32]  L.G.  Valiant.  “Universality  considerations  in  VLSI  circuits,"  IEEE  patent.  His  Ph.D.  dissertation.  “Area-Efficient  VLSI  Computation,"  which 

Trans.  Comput..  vol.  C-30,  pp.  135-140.  Feb.  1981.  deals  with  the  design  of  systolic  systems  and  with  the  problem  of  determining 

[33]  - .  “A  scheme  for  fast  parallel  communication."  SIAM  J.  Comput..  the  VLSI  area  of  a  graph,  won  the  first  ACM  Doctoral  Dissertation  Award.  He 

vol.  II,  no.  2,  pp.  350-361,  May  1982.  was  awarded  a  Presidential  Young  Investigator  Award  in  1985.  He  is  a  member 

[34]  A.  Waksman,  “A  permutation  network,"  J.  ACM.  vol.  15,  no.  I.  of  the  ACM.  and  he  serves  on  the  ACM  General  Technical  Achievement 

pp.  159-163,  Jan.  1968.  Award  Committee,  which  selects  the  Turing  Award  winner. 


Charles  E.  Leiserson  (M'83)  received  the  B.S.  de¬ 
gree  in  computer  science  and  mathematics  from  Yale 
University,  New  Haven,  CT.  in  1975  and  the  Ph  D. 
degree  in  computer  science  from  Camegie-Mellon 
University.  Pittsburgh.  PA.  in  1981. 

He  is  currently  an  Associate  Professor  of  Com¬ 
puter  Science  and  Engineenng  at  the  Massachusetts 
Institute  of  Technology  (M.l.T),  Cambridge.  In 
1981  he  joined  the  faculty  of  the  Theory  of  Com¬ 
putation  Group  in  the  M.l.T.  Laboratory  for  Com¬ 
puter  Science.  His  expertise  includes  parallel 


■ «  X'P  n  II 1 1 »  ^>>B .  uuuij  UIJ  la 


MASSACHUSETTS  INSTITUTE 


OF  TECHNOLOGY 


VLSI  PUBLICATIONS 


VLSI  Memo  No.  85-278 
October  1985 


SIGNAL  DELAY  IN  LEAKY  RC  MESH  MODELS  FOR  BIPOLAR  INTERCONNECT 


Peter  O'Brien  and  John  L.  Wyatt,  Jr. 


Abstract 


This  paper  extends  the  results  of  Penfield  and  Rubinstein  on  signal  delay  in 
RC  tree  netvorks.  Both  estimates  and  bounds  are  derived  for  the  step  response 
of  "leaky"  RC  trees  and  meshes,  a  class  of  networks  that  is  appropriate  for 
modelling  interconnect  in  digital  bipolar  circuits.  This  paper  is  intended  to 
serve  as  a  tutorial  as  veil  as  a  research  report.  Therefore  existing  vork  is 
explained  in  some  detail  along  with  the  derivation  of  new  results. 


Microsvstems 
Researcn  Center 
Room  39-321 


Massachusetts 
Institute 
of  Technology 


Cambridge 

Massachusetts 

02139 


Teleohone 
(617)  253-3138 


lJu  I  mi  »J  ■  'iM 


SIGSAI,  DEXAr  IN  EO.  INTERCONNECT 


Peter  R.  O'Brlea,  J.  L.  Wyatt,  Jr. 


Department  of  Electrical  Engineering  and  Computer  Science, 
Massachusetts  Institute  of  Technology,  Cambridge,  MA  02139 


Abstract 

Rubinstein  et.  al.  [1]  have  derived  rigor¬ 
ous,  closed-form  bounds  on  Che  step  response  of 
linear,  noouniform  RC  trees.  Wyatt  [2]  has  ext¬ 
ended  these  bounds  to  include  RC  meshes.  These 
results  have  proven  to  be  useful  for  fast  estim¬ 
ation  and  bounding  of  interconnect  delay  in  dig¬ 
ital  MOS  integrated  circuits  [3-5].  We  further 
extend  these  results  Co  include  leaky  RC  trees 
and  meshes,  a  class  of  networks  chat  is  approp¬ 
riate  for  modelling  interconnect  in  digital  bi¬ 
polar  circuits.  Finally,  we  discuss  two  prac¬ 
tical  problems  Involved  in  using  these  results: 
1.)  Extra  computational  requirements.  2.)  Accur¬ 
ate  modelling  of  digital  bipolar  gates. 

I.  Introduction 


Isolate  Che  resistor  subnetwork  R  containing 
all  Che  resistors  and  assign  reference  directions 
to  Che  capacitor  currents  as  shown  in  Fig.  2.  Let 
Che  node  connected  to  the  independent  source  serve 
as  Che  datum  node  of  !J.  With  Che  datum  grounded. 
Che  node  voltages  are  expressed  in  terms  of  the 
capacitor  currents  by  the  positive-definite,  sym¬ 
metric  matrix  R  as  shown  below. 


(1) 


When  e(c)  is  not  zero,  then  by  superposition 
we  have : 


''l 

^1  ••• 

«1N 

^1 

^.1  •• 

V 

A  leaky  RC  tree  or  mesh  meets  all  of  the  re- 
quiremeucs  of  an  RC  tree  or  mesh  (as  described 
in  [1]  and  [2]),  except  for  the  restriction  that 
there  be  no  grounded  resistors.  The  practical 
nocivation  behind  allowing  resistive  paths  to 
ground  in  RC  mesh  circuits  is  an  attempt  to  more 
accurately  model  interconnect  on  bipolar  chips. 
The  base  of  a  bipolar  transistor  loading  the  in¬ 
terconnect  offers  a  (nonlinear)  resistive  path  to 
ground.  A  corresponding  gate  electrode  of  an  MOS 
transistor  would  not  offer  such  a  resistive  path 
to  ground,  and  previous  works  (e.g.,  [ll,[2])  re¬ 
lied  on  the  abscence  of  such  paths. 

This  paper  is  concerned  with  bounds  on  the 
step  response  of  a  linear,  lumped,  leaky  RC  tree 
or  mesh  driven  by  an  ideal  voltage  source,  and 
starting  from  some  arbitrary  (possibly  non-zero) 
initial  eauilibrium  condition. 

II.  Hetvark  Differential  Equations  for  RC  Mes’nes 

«x  ®  S 


'■1  G 


f  itii 


Lecsv  RC  mesh. 


where  x  •  (x^. 


v(t)  •  Rl(t)  +  xe(t) 
T 


(2) 


is  a  column  vector  of 


dimensionless  numbers,  each  x^  being  numerically 

equal  to  the  potential  (in  volts)  produced  at  node 
i  due  solely  to  a  l-voit  source  at  the  datum, while 
open-circuiting  all  of  the  external  current  sources. 

Consider  the  response  to  a  step  change  in  e(c). 
Substituting  i,^  •  -C,  v,  into  (2) ,  and  identifying 


eq 


Xj^e(”)  ■  Xj^a(t)  as  t.he  final  equilibrium 


voltage  et  node  i,  yields  the  network  differential 
equations 

V.  -v,(t)  R..c.v.(t):  i.l . N:t>0 


which  are  identical  in  form  to  eq.  (2)  of  [2]: 
the  only  differences  are  that  in  [2]  v 


(3) 


1  for 


all  nodes  i,  and  the  resistance  matrix  R”(Rj^^) 

does  not  reflect  the  presence  of  any  resistive 
paths  to  ground. 


Fig.  2:  The  resiscor  subnecuork  R  excracced  fron 
Che  circuit  ia  Fig.  1  (driven  with  current  sources) 


Given  Facts  1  and  2,  Fact  3  follows  from  the 
network  differential  equations  (3).  See  [7]  for  a 
detailed  derivation. 

We  now  construct  a  reduced-order  model,  as  in 
[6].  Choosing  a  distinguished  node  1  as  the  out¬ 
put  node  of  loterest,  we  describe  Che  system  in 
terms  of  only  two  state  variables,  the  "normalized 
distance  Co  go" 


and  its  Integral 

•  J  gj^(t')dt’ 


III.  Optimal  Control  Method  for  Determining 
Step  Response  Bounds 

The  derivation  outlined  below  is  similar  to 
chat  in  [6],  but  the  results  now  apply  to  leaky 
meshes  starting  from  an  arbitrary  (possibly  non¬ 
zero)  equilibrium  condition.  We  shall  use  Che 
notation  v  to  denote  the  initial  equilibrium  vol- 

caee  distribution,  and  v  to  denote  the  final 
— 5 -  ,a<,  - 

eouilibrium  voltage  distribution. 


For  any  three  nodes,  i,j,k  of  a  leaky  RC  mesh, 

«ii\j  -  'Sci'^tJ 

where  the  resistances  in  (4)  ace  elements  of  the 
resistance  matrix  R  in  (1). 


The  proof  given  in  (2]  for  non-leaky  RC  meshes 
generalizes  to  leaky  RC  meshes  in  a  straightforward 
way  (for  details,  see  also  [7]). 


The  step  response  of  a  leaky  RC  mesh  is  com¬ 
pletely  monotone,  i.e.,  for  any  node  J  we  have; 

V  (c)  -  0,  Vc  -  0;  if  V  >  V 

■'  ■'eq  ■'o  (3) 

’  f,\  ^  n  IT  >  0;  if  V.  <  V. 

V. (c)  -  0,  Vt  -  i  1 

J  -“eq  ■'o 

The  proof  is  in  (7  ]  . 

Fact  3 

For  any  two  nodes  i,  k  of  a  leaky  RC  mesh: 

On  an  "up"  ("down")  transition,  i.e., 

V  >  V  (v  <  V  ) ,  we  have 

^iiPk  -  \i  pi  - 

_  eq  J  L 

“SclPk  -  "Hck  Pi  “ 


where  the  last  equality  follows  upon  substituting 
(3)  into  the  integral  and  evaluating.  Using  Fact 
3  in  (9)  yields  the  state  constraints: 


Tj^SlCt)  5  f  Jr)  ^  Tpgjr) 


^R,  ■  i\k=k  ^p  "  S^’xkSc- 

The  initial  condition  or.  g,  (t)  is  obviously 

8^(0)  -  1  (11) 

The  initial  condition  on  f^(c)  is: 

>  *  iFPikS-\  ■  *0,  ^^2) 


Mote  that  in  the  non-leaky  ease,  .x.  •  1  Vk,  and 
so  Tjj  reduces  to  (which  agrees  with  its 

definition  in  earlier  works  [1],  [2]). 

Hence,  the  optimal  control  problem  of  maxim¬ 
izing  (minimizing)  the  time  for  g,  (t)  to  reach  a 
"target"  value  of  0<gjl  subject  to  the  state  con¬ 
straints  (10),  initial  conditions  (11)  and  (12), 
and  dynamics: 

[fjc]  fo  -iipjc;]  roi 


0  gi(t) 


1  u(t): 


u(t)  -  0 


is  algebraically  equivalent  to  the  problem  ia  (61 
and  results  in  upper  (lower)  bounds  on  g, (c) 
(corresponding  to  lower  (upper)  bounds  on  v.(t)l. 
The  exact  expressions  are  omitted  for  the  sake  of 
brevity,  and  they  agree  with  prior  results  ia  (11 
and  (21  for  non-leaky  meshes.  The  complete  bound 
expressions,  along  with  a  detailed  derivation  can 
be  found  in  (  7  ] . 

The  fundamental  reason  why  the  bounding  ap¬ 
proach  presented  in  this  section  carries  over  so 
successfully  from  non-leaky  to  leaky  meshes  is 
that  in  each  case,  the  dtmamics  of  the  circuit  is 
governed  by  a  (negative)  "M-macri;{".  [3] 


(7) 


.  <• 
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IV. 


Extra  ConrouEaclonal  Reaulremencs 
The  formulas  for  the  clme  conscancs  T. 


•'1  1 

and  Tp  indicace  that  the  iaformation  required  in 

order  to  compute  bounds  for  leaky  RC  meshes  is 
as  follows :  . 

1.)  The  i^  row  (or  column)  of  Che  resis¬ 
tance  matrix  R. 


2.) 


The  diagonal  elements  of  the  resistance 
matrix  R. 


3.)  The  vector  x  discussed  In  section  II. 


If,  as  will  be  the  case  In  our  model,  Che  RC  net¬ 
work  has  only  one  capacitor  node  (node  #1)  connec¬ 
ted  CO  the  driving  voltage  source  through  a  single 


resistor  (R.),  then  it  can  be  shown  [7]  chat 
x^  «  R,_,/R^^Vk.  So,  in  fact,  item  3)  above  is 


Rl'  S 

really  equivalent  to  the  first  row(or  column)  of 
Che  resistance  matrix  R. 


Renee,  the  computational  problem  consists  of 
finding  certain  elements  of  Che  resistance  matrix 
R.  Onforcunacely,  the  R  matrix  is  "global”  in 
character  and  is  difficult  to  compute  if  there  are 
resistor  loops  or  resistive  paths  to  ground  (as 
opposed  CO  the  case  of  non-Ieakv  RC  trees  in  [1], 
where  the  desired  elements  of  R  could  be  deter¬ 
mined  very  easily) .  ~ 

However,  the  conductance  matrix  G  ■  R  is 


"local"  in  character,  sparse,  and  can  be  deter¬ 
mined  virtually  by  inspection  even  when  there  are 
resistor  loops  or  grounded  resistors.  One  ap¬ 
proach,  then,  is  CO  determine  G  directly  and  in¬ 
vert  it  to  find  R.  An  efficient  numerical  al¬ 


gorithm  CO  do  this  is  Che  iterative  "conjugate 
gradient"  met.hod  (9:  chapter  101  which  exploits 
the  fact  Chat  G  is  sparse,  symmetric,  and  positive- 
definite.  This  algorithm  would  have  to  be  applied 
•N  times  CO  solve  the  linear  systems: 


G(R)^ 


5i 


1, 


(14) 


.Alternate  approaches  to  computing  resistance 
matrices  have  been  explored  by  Brock  [10].  One 
such  algorithm  is  recursive  and  involves  removing 
resistors  one  at  a  time  until  Che  original  resis¬ 
tor  subnetwork  is  reduced  to  a  non-leaky  tree.  The 
efficiency  of  this  algorithm  depends  on  the  num¬ 
ber  of  resistors  chat  must  be  removed.  Another 
such  algorithm  is  direct  and  involves  a  fixed  cost 
of  enumerating  all  possible  spanning  trees  of  the 
original  resistor  subnetwork.  Then,  only  the  ele¬ 
ments  of  R  that  are  desired  (which  is  an  advantage 


in  our  application)  are  computed  at  a  smaller 
additional  cost  per  element. 


V.  Modelling  of  Cates 


.Accurate  modelling  of  ECl.  gates  is  a  crucial 
prerequisite  to  applying  the  theory'  in  this  paper 
to  Che  timing  analysis  of  real  bipolar  digital 
circuits.  Both  the  sourcing  and  loading  effects 


ai"- 


of  Che  gates  must  be  considered.  As  of  this 
writing,  Che  sourcing  effects  have  not  yet  been 
modelled,  but  the  loading  etfeccs  have.  V2e  pre¬ 
sent  here  the  results  of  our  load  modelling,  and 
a  strategy  for  source  modelling.  The  gate  model 
we  are  considering  is  shown  below  in  Fig  3. 


, _ _ I 


Fig.  3:  Proposed  Model  of  ECI.  Gate 


He  have  determined  load  Impedance  parameters 
of  R^  -  120kil,  “■  *  .065pF,  and 

-  13UA  which  match  our  particular  SPICE  model 

of  an  ECL  gate  very  closely.  We  tested  this  load 
model  by  driving  with  the  "real"  SPICE  gate  over 
a  wide  range  of  interconnect  runs  (lumped  RC  lines 
from  0  CO  500  mils;  r  •  400/100  mils;  c  ■  .683pF/ 
100  mils)  and  fanout  (1  to  10  loads  in  parallel). 
Deviation  from  the  true  delay  (measured  from  the 
source  gate  input  to  Che  load  gate  input)  was  less 
than  5%,  the  worst  case  being  high-fanout  with 
short  interconnect.  In  most  cases,  especially 
with  lower  fanout,  errors  were  substantially  less 
chan  52.  While  not  explicitly  mentioned  in  the 
previous  sections,  the  current  source  I,  ,  since 

l£, 


its  value  never  changes ,  does  not  complicate  the 
bounding  theory.  It  merely  provides  a  necessary 
dc  offset,  since  our  ECL  gates  swing  between 
-1.55  volts  (low)  and  -1.05  volts  (high). 

The  source  modelling  is  somewhat  more  com¬ 
plicated.  First,  it  is  clear  that  two  different 
source  models  are  needed  since  there  is  a  defi¬ 
nite  asymmetry  between  rising  and  falling  trans¬ 
itions.  Our  SPICE  model  of  an  ECL  gate  can  source 
more  current  chan  it  can  sink,  so  delays  are 
greater  and  waveform  slew  races  are  lower  for  a 
falling  transition.  Second,  it  is  clear  chat  a 
(nonlinear)  mapping  fromv^^(.)  to 

needed.  We  expect  the  mapping  h  to  be  close  to 


the  following  simple  form:  If  v,  (.)  is  charact¬ 
erized  by  a  starting  time  of  t~0  and  a  slew  race 


T.  ,  then  h(v_,  (.))  is  characterized  by  a  start- 


in 


ing  time  of  c  •  T^j  and  a  slew  race  of  T^ 


i .  e . , 


out 


h  will  have  both  a  pure  delay  effect  and  a  simple 
waveshape  effect.  Finally,  we  nece  that  the  re¬ 
sponse  CO  a  non-steo  driving  waveform,  such  as 


h(v  (.)),  can  still  be  bounded  in  a  manner  con- 
in 


sistenc  with  the  theory  presented  in  this  paper. 
Since  the  leaky  RC  mesh  being  driven  is  linear , 
bounding  the  step  response  is  equivalent  to  bound¬ 
ing  the  resoonse  to  an  arbitrarv  non-steo  inouc 
[II. (7|. 
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